Take-Home Exercise 01: Spatial Point Patterns Analysis of Grab Trajectories in Singapore

Application of Spatial Point Patterns Analysis to discover the geographical distribution of Grab hailing services in Singapore, implemented using spatstat package (Baddeley, Turner, and Rubak 2022) and spNetwork package (Gelb & Apparicio, 2023) in R environment

grab-posisi
spatial point patterns
spatstat
spNetwork
r
Author
Published

January 15, 2024

Modified

February 4, 2024

1.0 Introduction

Human mobility, the spatial-temporal dynamics of human movements, serves as a critical reflection of societal patterns and human behaviors. With the advancement and pervasiveness of Information and Communication Technologies (ICT) in our daily life, especially smart phone, a large volume of data related to human mobility have been collected. These data provide valuable insights into understanding how individuals and populations move within and between different geographical locations. By using appropriate GIS analysis methods, these data can turn into valuable inisghts for predicting future mobility trrends and developing more efficient and sustainable strategies for managing urban mobility.

In this study, I will apply Spatial Point Patterns Analysis methods to discover the geographical and spatio-temporal distribution of Grab hailing services locations in Singapore. In order to determine the geographical and spatio-temporal patterns of the Grab hailing services, I will develop traditional Kernel Density Estimation (KDE) and Temporal Network Kernel Density Estimation (TNKDE). KDE layers will help identify the areas with high concentration of Grab hailing services, providing insights into the demand and popularity of these services in different parts of Singapore. TNKDE, on the other hand, will allow for analysis of how the distribution of Grab hailing services changes over time, revealing temporal patterns and trends in their usage. These spatial and spatio-temporal analyses will contribute to a better understanding of the dynamics and effectiveness of Grab’s mobility services in Singapore.

2.0 Literature Review of Spatial Point Pattern Analysis

Spatial point pattern analysis is concerned with description, statistical characterization, modeling and visulisation of point patterns over space and making inference about the process that could have generated an observed pattern (Boots & Getis, 1988 ,Rey et al., 2023; Pebesma & Bivand, 2023). According to this theory, empirical spatial distribution of points in our daily life are not controlled by sampling, but a result of an underlying geographically-continuous process (Rey et al., 2023). For example, an COVID-19 cluster did not happen by chance, but due to a spatial process of close-contact infection.

When analysing real-world spatial points, it is important to analyse whether the observed spatial points are randomly distributed or patterned due to a process or interaction (Floch et al., 2018). In “complete random” distribution, points are located everywhere with the same probability and independently of each other. On the other hand, the spatial points can be clustered or dispersed due to an underlying point process. However, it is challenging to use heuristic observation and intuitive interpretation to detect whether a spatial point pattern exists (Baddeley et al., 2015; Floch et al., 2018). Hence, spatial point pattern analysis can be used to detect the spatial concentration or dispersion phenomena.

When analysing and interpreting the properties of a point pattern, these properties can be categorized into two: (a) first-order properties and (b) second-order properties (Yuan et al., 2020; Gimond, 2023). First-order properties concern with the characteristics of individual point locations and their variations of their density across space (Gimond, 2023). Under this conception, observations vary from point to point due to changes in the underlying property. Second-order properties focus on not only individual points, but also the interactions between points and their influences on one another (Gimond, 2023). Under this conception, observations vary from place to place due to interaction effects between observations. First-order properties of point patterns are mostly addressed by density-based techniques, such as quadrat analysis and kernel density estimation, whereas, distance-based techniques, such nearest neighbour index and K-functions, are often used to analyse second-order properties since they take into account the distance between point pairs (Yuan et al., 2020; Gimond, 2023).

3.0 Importing Packages

Before we start the exercise, we will need to import necessary R packages first. We will use the following packages:

  • sf : provides a standardised way to encode spatial vector data in R environment, facilitating spatial data operations and analysis.

  • st : creats simple features from numeric vectors, matrices, or lists, enabling the representation and manipulation of spatial structures in R.

  • arrow : for reading and writing Apache Parquet files, a columnar storage file format optimized for use with big data processing frameworks.

  • lubridate : for tackling with temporal data (dates and times), providing tools to parse, manipulate, and do arithmetic with date-time objects.

  • spatstat: A package for statistical analysis of spatial data, specifically Spatial Point Pattern Analysis. This package was provided by Baddeley, Turner and Ruback (2015) and gives a comprehensive list of functions to perform 1st- and 2nd-order spatial point patterns analysis and derive kernel density estimation (KDE) layer.

  • tidyverse : a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structure.

  • raster : reads, writes, manipulates, analyses and model of gridded spatial data (i.e. raster). In this hands-on exercise, it will be used to convert image output generate by spatstat into raster format.

  • tmap: Packages used for creating static and interactive visualisations summary statistics and KDE layers.

  • spatstat : for spatial statistics with a strong focus on analysing spatial point patterns.

  • spNetwork : provides several functions to perform spatial analysis on network, including network kernel density estimation and point pattern analysis.

  • classInt : for choosing univariate class intervals for mapping or other graphics purposes.

  • viridis : for providing viridis color palette designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.

  • gifski : for converting images to GIF animations, useful for creating dynamic visualizations and presentations.

pacman::p_load(sf,st,arrow,lubridate,tidyverse,raster,tmap,ggplot2, patchwork,spatstat,spNetwork,classInt,viridis,gifski)

4.0 Importing Datasets into R Environment

4.1 Datasets

In this study, we will use Grab-Posisi dataset, which is a comprehensive GPS trajectory dataset for car-hailing services in Southeast Asia. Apart from the time and location of the object, GPS trajectories are also characterised by other parameters such as speed, the headed direction, the area and distance covered during its travel, and the travelled time. Thus, the trajectory patterns from users GPS data are a valuable source of information for a wide range of urban applications, such as solving transportation problems, traffic prediction, and developing reasonable urban planning.

Moreover, we will also use OpenStreetMap dataset, which is an open-sourced geospatial dataset including shapefiles of important layers including road networks, forests, building footprints and many other points of interest.

To extract the Singapore boundary, we will use Master Plan 2019 Subzone Boundary (No Sea), provided by data.gov.sg.

4.2 Importing Grab-Posisi Dataset

Each trajectory in Grab-Posisi dataset is serialised in a file in Apache Parquet format. The Singapore portion of the dataset is packaged into a total of 10 Parquet files.

Firstly, we will use read_parquet function from arrow package, which allows us to read Parquet files into R environment as a data frame (more specifically, a tibble).

df <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00000.snappy.parquet',as_data_frame = TRUE)
df1 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00001.snappy.parquet',as_data_frame = TRUE)
df2 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00002.snappy.parquet',as_data_frame = TRUE)
df3 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00003.snappy.parquet',as_data_frame = TRUE)
df4 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00004.snappy.parquet',as_data_frame = TRUE)
df5 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00005.snappy.parquet',as_data_frame = TRUE)
df6 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00006.snappy.parquet',as_data_frame = TRUE)
df7 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00007.snappy.parquet',as_data_frame = TRUE)
df8 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00008.snappy.parquet',as_data_frame = TRUE)
df9 <- read_parquet('~/IS415-GAA/data/GrabPosisi/part-00009.snappy.parquet',as_data_frame = TRUE)

To consolidate all trajectory instances into a single dataframe, we’ll vertically bind all 10 imported dataframes using bind_rows() function from tidyverse package.

df_trajectories <- bind_rows(df,df1,df2,df3,df4,df5,df6,df7,df8,df9)

To get a quick overview of the dataset, we’ll first check the number of trajectory instances using dim() function. Then, we’ll use head() function to quickly scan through the data columns and values

dim(df_trajectories)
[1] 30329685        9
head(df_trajectories)
# A tibble: 6 × 9
  trj_id driving_mode osname  pingtimestamp rawlat rawlng speed bearing accuracy
  <chr>  <chr>        <chr>           <int>  <dbl>  <dbl> <dbl>   <int>    <dbl>
1 70014  car          android    1554943236   1.34   104.  18.9     248      3.9
2 73573  car          android    1555582623   1.32   104.  17.7      44      4  
3 75567  car          android    1555141026   1.33   104.  14.0      34      3.9
4 1410   car          android    1555731693   1.26   104.  13.0     181      4  
5 4354   car          android    1555584497   1.28   104.  14.8      93      3.9
6 32630  car          android    1555395258   1.30   104.  23.2      73      3.9

From the result above, we can see that the dataset includes a total of 30329685 trajectory instances, each with a total of 9 columns as follows:

Column Name Data Type Remark
trj_id chr Trajectory ID
driving_mode chr Mode of Driving
osname chr Operating System used for Data Recording
pingtimestamp int Data Recording Timestamp
rawlat num Latitude Value (WGS-84)
rawlng num Longitude Value (WGS-84)
speed num Speed
bearing int Bearing
accuracy num Accuracy

From the above table, it is seen that the pingtimestamp is recorded as int. We need to convert this data to proper datetime format to derive meaningful temporal insights of the data. To do so, we will use as_datetime() function from lubridate package.

df_trajectories$pingtimestamp <- as_datetime(df_trajectories$pingtimestamp)

4.3 Importing OpenStreetMap road data for Malaysia, Singapore and Brunei

The gis_osm_roads_free_1 dataset, which we downloaded from OpenStreetMap, is in ESRI shapefile format. To use this data in an R-environment, we need to import it as an sf object. We can do this using the st_read() function of the sf package. This function reads the shapefile data and returns an sf object that can be used for further analysis.

osm_road_sf <- st_read(dsn = "~/IS415-GAA/data/geospatial", 
                layer = "gis_osm_roads_free_1") %>% st_transform(crs = 3414)

4.4 Importing Singapore Master Plan Planning Subzone boundary data

The MP14_SUBZONE_WEB_PL dataset, which we downloaded from data.gov.sg, is in ESRI shapefile format. To use this data in an R-environment, we need to import it as an sf object. We can do this using the st_read() function of the sf package. This function reads the shapefile data and returns an sf object that can be used for further analysis.

mpsz_sf <- st_read(dsn = "~/IS415-GAA/data/geospatial", 
                layer = "MPSZ-2019") %>% st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `/Users/khantminnaing/IS415-GAA/data/geospatial' using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
Reflection

In the code chunk above, we use %>% operator is used to pipe the output of st_read() to the st_transform() function. Since the dataset we are using is the Singapore boundary, we need to assign the standard coordinate reference system for Singapore, which is SVY21 (EPSG:3414). st_transform() function transforms the coordinate reference system of the sf object to 3414.

After importing the dataset, we will plot it to see how it looks. The plot() function is used to plot the geometry of the sf object. The st_geometry() function is used to extract the geometry of the mpsz_sf object.

par(mar = c(0,0,0,0))
plot(st_geometry(mpsz_sf))

5.0 Data Wrangling

Data wrangling is the process of converting and transforming raw data into a usable form and is carried out prior to conducting any data analysis. In this section, we will carry out different data wrangling steps to prepare our datasets ready for analysis.

5.1 Extracting Trip Starting Locations and Temporal Data Values from Grab-Posisi dataset

Firstly, we will extract trip starting locations for all unqiue trajectories in the dataset and store them to a new df named origin_df. We are also interested in obtaining valuable temporal data such as the day of the week, the hour, and the date (yy-mm-dd). To do so, we will use the following functions from lubridate package, and add the newly derived values as columns to origin_df.

  • wday: allows us to get days component of a date-time

  • hour: allows us to get hours component of a date-time

  • mday: allows us to parse dates with year, month, and day components

origin_df <- df_trajectories %>%
  group_by(trj_id) %>%
  arrange(pingtimestamp) %>%
  filter(row_number()==1) %>% 
  mutate(weekday = wday(pingtimestamp,
                       label=TRUE,
                       abbr=TRUE),
         pickup_hr = factor(hour(pingtimestamp)),
         day = factor(mday(pingtimestamp)))

5.2 Extracting Trip Ending Locations and Temporal Data Values from Grab-Posisi dataset

Similar to what we did in previous session, we are also interested to extract trip ending locations and associated temporal data into a new df called destination_df. We will use the same functions from previous session here.

destination_df <- df_trajectories %>%
  group_by(trj_id) %>%
  arrange(desc(pingtimestamp)) %>%
  filter(row_number()==1) %>% 
  mutate(weekday = wday(pingtimestamp,
                       label=TRUE,
                       abbr=TRUE),
         dropoff_hr = factor(hour(pingtimestamp)),
         day = factor(mday(pingtimestamp)))
Reflection

arrange() function sort the timestamps in ascending order by default. Hence, this default order is applied to origin_df. However, for destination_df, the arrange(desc()) argument is used to modify the default order to descending.

5.3 Converting to sf tibble data.frame

origin_df & destination_df that we created earlier are of class “DataFrame”. However, when carrying out point pattern analysis, we need to convert them to sf object to allow for spatial operations. The sf package introduces the sf data frame, which is a tibble (a modern reimagining of the data frame) that has some special characteristics for handling spatial data. Hence, we will convert origin_df & destination_df into sf tibble data.frame using rowlng and rowlat columns as inputs for coordinates. Subsequently, we will assign appropriate coordinate reference system for Singapore, SVY21 (EPSG:3414).

origin_sf <- st_as_sf(origin_df,
                      coords = c("rawlng", "rawlat"),
                      crs = 4326) %>%
  st_transform(crs = 3414)

destination_sf <- st_as_sf(destination_df,
                      coords = c("rawlng", "rawlat"),
                      crs = 4326) %>%
  st_transform(crs = 3414)

5.4 Creating a CoastalLine of Singapore

The original mpsz_sf dataset we imported include information of all URA Master Plan planning area boundaries. However, for this analysis, we only need the national-level boundary of Singapore. Hence, we will need to union all the subzone boundaries to one single polygon boundary. Also, Grab ride-hailing service is only available on the main Singapore islands. Hence, we will need to remove outer islands which Grab service is not available. In particular, we will remove the following planning subzones: NORTH-EASTERN ISLANDS, SOUTHERN GROUP, SUDONG & SEMAKAU.

We can remove these subzones using the subset() function. The subset() function is used to extract rows from a data frame that meet certain conditions.

northeasten.islands <- subset(mpsz_sf, mpsz_sf$SUBZONE_N == "NORTH-EASTERN ISLANDS")
southern.islands <- subset(mpsz_sf, mpsz_sf$SUBZONE_N == "SOUTHERN GROUP")
sudong <- subset(mpsz_sf, mpsz_sf$SUBZONE_N == "SUDONG")
semakau <- subset(mpsz_sf,mpsz_sf$SUBZONE_N == "SEMAKAU")

outerislands <- dplyr::bind_rows(list(northeasten.islands,southern.islands,sudong,semakau))
Reflection

In the code chunk above, we first created four new data frames called northeasten.islands, southern.islands, sudong, and semakau by selecting rows from mpsz_sf where the value in the SUBZONE_N column matches the corresponding value.

After that, we used bind_rows() function from the dplyr package to combine these four data frames into a single data frame called outerislands.

After importing the dataset, we will plot it to see how it looks.

par(mar = c(0,0,0,0))
plot(st_geometry(outerislands))

As mentioned earlier, we only need to get national-level boundary of Singapore, without outer islands. To do so, we will need to process the mpsz_sf layer to achieve the outcome. - We will first use st_union() function from the sf package to combine the geometries of mpsz_sf and outerislands sf objects into a single geometry each. - Next, we will use st_difference() function then removes the overlapping areas between the two geometries. - Finally, we will store the non-overlapping areas into a new sf objected called sg_sf.

sg_sf <- st_difference(st_union(mpsz_sf), st_union(outerislands))

To assess whether the geometry of the newly created sg_sf matches our intended outcome, we will plot it out.

par(mar = c(0,0,0,0))
plot(st_geometry(sg_sf))

5.5 Extracting Road Layers within Singapore

As we have seen in Section 4.3., osm_road_sf dataset includes road networks from not only Singapore, but also Malaysia and Brunei. However, our analysis is focused on Singapore. Hence, we will need to remove unecessary data rows. To do so, we will

sg_road_sf <- st_intersection(osm_road_sf,lsg_sf)

Next, we will look at the classification of road networks as provided by OpenStreetMap.

unique(sg_road_sf$fclass)
 [1] "primary"        "residential"    "tertiary"       "footway"       
 [5] "service"        "secondary"      "motorway"       "motorway_link" 
 [9] "trunk"          "trunk_link"     "primary_link"   "pedestrian"    
[13] "living_street"  "unclassified"   "steps"          "track_grade2"  
[17] "track"          "secondary_link" "cycleway"       "path"          
[21] "tertiary_link"  "track_grade1"   "track_grade3"   "unknown"       
[25] "track_grade5"   "bridleway"      "track_grade4"  

Looking at the road classification, it is observed that not all categories are relevant to our analysis, which is primarily concerned with driving networks where taxis can facilitate pick-ups or drop-offs. Hence, we will implement a filtering process on the dataset to exclude road segments that fall outside the scope of our analysis.

Firstly we will specify the road classes that we want to retain.

driving_classes <- c("primary", "primary_link", "residential", "secondary", "secondary_link", "service", "tertiary", "tertiary_link") 

Next, we will filter sg_road_sf object to remove all the rows that does not have our desired f_class attribute value.

sg_driving_sf <- sg_road_sf %>%
  filter(fclass %in% driving_classes)
unique(sg_driving_sf$fclass)
[1] "primary"        "residential"    "tertiary"       "service"       
[5] "secondary"      "primary_link"   "secondary_link" "tertiary_link" 
Reflection

Why motorway and motorway_link classes are not included?

According to OpenStreetMap, fclass motorway refers to expressway. In Singapore, stopping or parking a vehicle on an expressway is illegal under the Road Traffic Act. Hence, motorway (and motorway_link) are not relevant for network constraint kernel density estimation (NKDE) analysis that we will carry out later.

Now that we have filterd out the dataset, we will now plot to see the driving road network of Singapore using tmap.

tmap_mode("plot")

tm_shape(sg_sf) +
  tm_polygons() +
tm_shape(sg_driving_sf) +
  tm_lines(col="fclass", palette ="viridis") +
  tm_layout(main.title = "Road Network in Singapore",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.outside = TRUE,
            frame = TRUE) +
  tm_borders(alpha = 0.5)

5.5 Converting the Simple Features to Planar Point Pattern Object

In order to use the capabilities of spatstat packahe, a spatial dataset should be converted into an object of class planar point pattern ppp (Baddeley et al., 2015). A point pattern object contains the spatial coordinates of the points, the marks attached to the points (if any), the window in which the points were observed, and the name of the unit of length for the spatial coordinates. s. Thus, a single object of class ppp contains all the information required to perform spatial point pattern analysis.

In previous section, we have created sf objects of Grab trajectory origin and destination points. Now, we will convert them into ppp objects using as.ppp() function from spatstat package.

origin_ppp <- as.ppp(st_coordinates(origin_sf), st_bbox(origin_sf))

par(mar = c(0,0,1,0))
plot(origin_ppp)

Reflection

The code chunk above converts the origin_sf object to a point pattern object of class ppp. st_coordinates() function is used to extract the coordinates of the origin_sf object and st_bbox() function is used to extract the bounding box of the origin_sf object. The resulting object origin_ppp is a point pattern object of class ppp.

destination_ppp <- as.ppp(st_coordinates(destination_sf), st_bbox(destination_sf))

par(mar = c(0,0,1,0))
plot(destination_ppp)

5.6 Handling Data Errors

Before going striaght into analysis, we will need to a quick look at the summary statistics of the newly created ppp objects. This is an important step to ensure that the data is free of errors and that a reliable analysis can be performed.

5.6.1 Data Error Handling for origin_ppp

We will use summary() function to get summary information of origin_ppp object.

summary(origin_ppp)
Planar point pattern:  28000 points
Average intensity 2.473666e-05 points per square unit

Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units

Window: rectangle = [3628.24, 49845.23] x [25198.14, 49689.64] units
                    (46220 x 24490 units)
Window area = 1131920000 square units

We can also check if there is any duplicated points in origin_ppp object using any(duplicated() function.

any(duplicated(origin_ppp))
[1] FALSE

The code output is FALSE, which means there are no duplication of point coordaintes in the origin_ppp object.

Reflection

Why do we need to check duplication?

When analyzing spatial point processes, it is important to avoid duplication of points. This is because statistical methodology for spatial point processes is based largely on the assumption that processes are simple, i.e., that points of the process can never be coincident. When the data have coincident points, some statistical procedures designed for simple point processes will be severely affected (Baddeley et al., 2015).

5.6.2 Data Error Handling for destination_ppp

We will use summary() function to get summary information of destination_ppp object.

summary(destination_ppp)
Planar point pattern:  28000 points
Average intensity 2.493661e-05 points per square unit

Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units

Window: rectangle = [3637.21, 49870.63] x [25221.3, 49507.79] units
                    (46230 x 24290 units)
Window area = 1122850000 square units

We can also check if there is any duplicated points in destination_ppp object using any(duplicated() function.

any(duplicated(destination_ppp))
[1] FALSE

The code output is FALSE, which means there are no duplication of point coordinates in the destination_ppp object.

5.7 Creating Observation Windows

Many data types in spatstat require us to specify the region of space inside which the data were observed. This is the observation window and it is represented by an object of class owin. In this analysis, our study area is Singapore, hence we will use Singapore boundary as the observation window for spatial point pattern analysis.

In Section 5.4, we have already created the sg_sf object, which represents the Singapore boundary (without outer islands). To convert this sf object to owin object, we will use as.owin() function from spatstat package.

sg_owin <- as.owin(sg_sf)
plot.owin(sg_owin)

We will use summary() function to get summary information of sg_owin object.

summary(sg_owin)
Window: polygonal boundary
56 separate polygons (36 holes)
                  vertices         area relative.area
polygon 1            15307  7.00834e+08      9.92e-01
polygon 2              285  1.61128e+06      2.28e-03
polygon 3               27  1.50315e+04      2.13e-05
polygon 4 (hole)        41 -4.01660e+04     -5.69e-05
polygon 5 (hole)       317 -5.11280e+04     -7.24e-05
polygon 6 (hole)         3 -4.14099e-04     -5.86e-13
polygon 7               30  2.80002e+04      3.97e-05
polygon 8 (hole)         4 -2.86396e-01     -4.06e-10
polygon 9 (hole)         3 -1.81439e-04     -2.57e-13
polygon 10 (hole)        3 -8.68789e-04     -1.23e-12
polygon 11 (hole)        3 -5.99535e-04     -8.49e-13
polygon 12 (hole)        3 -3.04561e-04     -4.31e-13
polygon 13 (hole)        3 -4.46076e-04     -6.32e-13
polygon 14 (hole)        3 -3.39794e-04     -4.81e-13
polygon 15 (hole)        3 -4.52043e-05     -6.40e-14
polygon 16 (hole)        3 -3.90173e-05     -5.53e-14
polygon 17 (hole)        3 -9.59850e-05     -1.36e-13
polygon 18 (hole)        4 -2.54488e-04     -3.60e-13
polygon 19 (hole)        4 -4.28453e-01     -6.07e-10
polygon 20 (hole)        4 -2.18616e-04     -3.10e-13
polygon 21 (hole)        5 -2.44411e-04     -3.46e-13
polygon 22 (hole)        5 -3.64686e-02     -5.16e-11
polygon 23              71  8.18750e+03      1.16e-05
polygon 24 (hole)        6 -8.37554e-01     -1.19e-09
polygon 25 (hole)       38 -7.79904e+03     -1.10e-05
polygon 26 (hole)        3 -3.41897e-05     -4.84e-14
polygon 27 (hole)        3 -3.65499e-03     -5.18e-12
polygon 28 (hole)        3 -4.95057e-02     -7.01e-11
polygon 29              91  1.49663e+04      2.12e-05
polygon 30 (hole)        3 -3.79135e-02     -5.37e-11
polygon 31 (hole)        5 -2.92235e-04     -4.14e-13
polygon 32 (hole)        3 -7.43616e-06     -1.05e-14
polygon 33 (hole)      270 -1.21455e+03     -1.72e-06
polygon 34 (hole)       19 -4.39650e+00     -6.23e-09
polygon 35 (hole)       35 -1.38385e+02     -1.96e-07
polygon 36 (hole)       23 -1.99656e+01     -2.83e-08
polygon 37              40  1.38607e+04      1.96e-05
polygon 38 (hole)       41 -6.00381e+03     -8.50e-06
polygon 39 (hole)        7 -1.40546e-01     -1.99e-10
polygon 40 (hole)       11 -8.36705e+01     -1.18e-07
polygon 41 (hole)        3 -2.33435e-03     -3.31e-12
polygon 42              45  2.51218e+03      3.56e-06
polygon 43             139  3.22293e+03      4.56e-06
polygon 44             148  3.10395e+03      4.40e-06
polygon 45 (hole)        4 -1.72650e-04     -2.44e-13
polygon 46              75  1.73526e+04      2.46e-05
polygon 47              83  5.28920e+03      7.49e-06
polygon 48             106  3.04104e+03      4.31e-06
polygon 49             264  1.50631e+06      2.13e-03
polygon 50              71  5.63061e+03      7.97e-06
polygon 51              10  1.99717e+02      2.83e-07
polygon 52 (hole)        3 -1.37223e-02     -1.94e-11
polygon 53             487  2.06117e+06      2.92e-03
polygon 54              65  8.42861e+04      1.19e-04
polygon 55              47  3.82087e+04      5.41e-05
polygon 56              22  6.74651e+03      9.55e-06
enclosing rectangle: [2667.54, 55941.94] x [21448.47, 50256.33] units
                     (53270 x 28810 units)
Window area = 706156000 square units
Fraction of frame area: 0.46

5.8 Combining ppp objects and owin object

In section 5.5, we have created two ppp objects - origin_ppp and destination_ppp, each representing the spatial points of Grab trajectory origin and destination. In section 5.7, we have created a owin object called sg_owin, which represent the observation window of our analysis.

The observation window sg_owin and the point pattern origin_ppp or destination_ppp can be combined, so that the custom window replaces the default ractangular extent (as seen in section 5.5).

origin_ppp_sg = origin_ppp[sg_owin]
destination_ppp_sg = destination_ppp[sg_owin]

par(mar = c(0,0,1,0))
plot(origin_ppp_sg)

plot(destination_ppp_sg)

We will use summary() function to get summary information of the newly created origin_ppp_sg object and destination_ppp_sg object.

summary(origin_ppp_sg)
Planar point pattern:  28000 points
Average intensity 3.965129e-05 points per square unit

Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units

Window: polygonal boundary
56 separate polygons (36 holes)
                  vertices         area relative.area
polygon 1            15307  7.00834e+08      9.92e-01
polygon 2              285  1.61128e+06      2.28e-03
polygon 3               27  1.50315e+04      2.13e-05
polygon 4 (hole)        41 -4.01660e+04     -5.69e-05
polygon 5 (hole)       317 -5.11280e+04     -7.24e-05
polygon 6 (hole)         3 -4.14099e-04     -5.86e-13
polygon 7               30  2.80002e+04      3.97e-05
polygon 8 (hole)         4 -2.86396e-01     -4.06e-10
polygon 9 (hole)         3 -1.81439e-04     -2.57e-13
polygon 10 (hole)        3 -8.68789e-04     -1.23e-12
polygon 11 (hole)        3 -5.99535e-04     -8.49e-13
polygon 12 (hole)        3 -3.04561e-04     -4.31e-13
polygon 13 (hole)        3 -4.46076e-04     -6.32e-13
polygon 14 (hole)        3 -3.39794e-04     -4.81e-13
polygon 15 (hole)        3 -4.52043e-05     -6.40e-14
polygon 16 (hole)        3 -3.90173e-05     -5.53e-14
polygon 17 (hole)        3 -9.59850e-05     -1.36e-13
polygon 18 (hole)        4 -2.54488e-04     -3.60e-13
polygon 19 (hole)        4 -4.28453e-01     -6.07e-10
polygon 20 (hole)        4 -2.18616e-04     -3.10e-13
polygon 21 (hole)        5 -2.44411e-04     -3.46e-13
polygon 22 (hole)        5 -3.64686e-02     -5.16e-11
polygon 23              71  8.18750e+03      1.16e-05
polygon 24 (hole)        6 -8.37554e-01     -1.19e-09
polygon 25 (hole)       38 -7.79904e+03     -1.10e-05
polygon 26 (hole)        3 -3.41897e-05     -4.84e-14
polygon 27 (hole)        3 -3.65499e-03     -5.18e-12
polygon 28 (hole)        3 -4.95057e-02     -7.01e-11
polygon 29              91  1.49663e+04      2.12e-05
polygon 30 (hole)        3 -3.79135e-02     -5.37e-11
polygon 31 (hole)        5 -2.92235e-04     -4.14e-13
polygon 32 (hole)        3 -7.43616e-06     -1.05e-14
polygon 33 (hole)      270 -1.21455e+03     -1.72e-06
polygon 34 (hole)       19 -4.39650e+00     -6.23e-09
polygon 35 (hole)       35 -1.38385e+02     -1.96e-07
polygon 36 (hole)       23 -1.99656e+01     -2.83e-08
polygon 37              40  1.38607e+04      1.96e-05
polygon 38 (hole)       41 -6.00381e+03     -8.50e-06
polygon 39 (hole)        7 -1.40546e-01     -1.99e-10
polygon 40 (hole)       11 -8.36705e+01     -1.18e-07
polygon 41 (hole)        3 -2.33435e-03     -3.31e-12
polygon 42              45  2.51218e+03      3.56e-06
polygon 43             139  3.22293e+03      4.56e-06
polygon 44             148  3.10395e+03      4.40e-06
polygon 45 (hole)        4 -1.72650e-04     -2.44e-13
polygon 46              75  1.73526e+04      2.46e-05
polygon 47              83  5.28920e+03      7.49e-06
polygon 48             106  3.04104e+03      4.31e-06
polygon 49             264  1.50631e+06      2.13e-03
polygon 50              71  5.63061e+03      7.97e-06
polygon 51              10  1.99717e+02      2.83e-07
polygon 52 (hole)        3 -1.37223e-02     -1.94e-11
polygon 53             487  2.06117e+06      2.92e-03
polygon 54              65  8.42861e+04      1.19e-04
polygon 55              47  3.82087e+04      5.41e-05
polygon 56              22  6.74651e+03      9.55e-06
enclosing rectangle: [2667.54, 55941.94] x [21448.47, 50256.33] units
                     (53270 x 28810 units)
Window area = 706156000 square units
Fraction of frame area: 0.46
summary(destination_ppp_sg)
Planar point pattern:  27997 points
Average intensity 3.964704e-05 points per square unit

Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units

Window: polygonal boundary
56 separate polygons (36 holes)
                  vertices         area relative.area
polygon 1            15307  7.00834e+08      9.92e-01
polygon 2              285  1.61128e+06      2.28e-03
polygon 3               27  1.50315e+04      2.13e-05
polygon 4 (hole)        41 -4.01660e+04     -5.69e-05
polygon 5 (hole)       317 -5.11280e+04     -7.24e-05
polygon 6 (hole)         3 -4.14099e-04     -5.86e-13
polygon 7               30  2.80002e+04      3.97e-05
polygon 8 (hole)         4 -2.86396e-01     -4.06e-10
polygon 9 (hole)         3 -1.81439e-04     -2.57e-13
polygon 10 (hole)        3 -8.68789e-04     -1.23e-12
polygon 11 (hole)        3 -5.99535e-04     -8.49e-13
polygon 12 (hole)        3 -3.04561e-04     -4.31e-13
polygon 13 (hole)        3 -4.46076e-04     -6.32e-13
polygon 14 (hole)        3 -3.39794e-04     -4.81e-13
polygon 15 (hole)        3 -4.52043e-05     -6.40e-14
polygon 16 (hole)        3 -3.90173e-05     -5.53e-14
polygon 17 (hole)        3 -9.59850e-05     -1.36e-13
polygon 18 (hole)        4 -2.54488e-04     -3.60e-13
polygon 19 (hole)        4 -4.28453e-01     -6.07e-10
polygon 20 (hole)        4 -2.18616e-04     -3.10e-13
polygon 21 (hole)        5 -2.44411e-04     -3.46e-13
polygon 22 (hole)        5 -3.64686e-02     -5.16e-11
polygon 23              71  8.18750e+03      1.16e-05
polygon 24 (hole)        6 -8.37554e-01     -1.19e-09
polygon 25 (hole)       38 -7.79904e+03     -1.10e-05
polygon 26 (hole)        3 -3.41897e-05     -4.84e-14
polygon 27 (hole)        3 -3.65499e-03     -5.18e-12
polygon 28 (hole)        3 -4.95057e-02     -7.01e-11
polygon 29              91  1.49663e+04      2.12e-05
polygon 30 (hole)        3 -3.79135e-02     -5.37e-11
polygon 31 (hole)        5 -2.92235e-04     -4.14e-13
polygon 32 (hole)        3 -7.43616e-06     -1.05e-14
polygon 33 (hole)      270 -1.21455e+03     -1.72e-06
polygon 34 (hole)       19 -4.39650e+00     -6.23e-09
polygon 35 (hole)       35 -1.38385e+02     -1.96e-07
polygon 36 (hole)       23 -1.99656e+01     -2.83e-08
polygon 37              40  1.38607e+04      1.96e-05
polygon 38 (hole)       41 -6.00381e+03     -8.50e-06
polygon 39 (hole)        7 -1.40546e-01     -1.99e-10
polygon 40 (hole)       11 -8.36705e+01     -1.18e-07
polygon 41 (hole)        3 -2.33435e-03     -3.31e-12
polygon 42              45  2.51218e+03      3.56e-06
polygon 43             139  3.22293e+03      4.56e-06
polygon 44             148  3.10395e+03      4.40e-06
polygon 45 (hole)        4 -1.72650e-04     -2.44e-13
polygon 46              75  1.73526e+04      2.46e-05
polygon 47              83  5.28920e+03      7.49e-06
polygon 48             106  3.04104e+03      4.31e-06
polygon 49             264  1.50631e+06      2.13e-03
polygon 50              71  5.63061e+03      7.97e-06
polygon 51              10  1.99717e+02      2.83e-07
polygon 52 (hole)        3 -1.37223e-02     -1.94e-11
polygon 53             487  2.06117e+06      2.92e-03
polygon 54              65  8.42861e+04      1.19e-04
polygon 55              47  3.82087e+04      5.41e-05
polygon 56              22  6.74651e+03      9.55e-06
enclosing rectangle: [2667.54, 55941.94] x [21448.47, 50256.33] units
                     (53270 x 28810 units)
Window area = 706156000 square units
Fraction of frame area: 0.46

6.0 Exploratory Spatial Data Analysis

6.1 Measuring Central Tendency

Descriptive statistics are used in point pattern analysis to summarise a point pattern’s basic properties, such as its central tendency and dispersion. The mean centre and the median centre are two often employed metrics for central tendency (Gimond, 2019).

6.1.1 Mean Center

Mean center is the arithmetic average of the (x, y) coordinates of all point in the study area. Similar to mean in statistical analysis, mean center is influenced to a greater degree by the outliers. (Yuan et al.,2020)

origin_xy <- st_coordinates(origin_sf)
origin_mc <- apply(origin_xy, 2, mean)

destination_xy <- st_coordinates(destination_sf)
destination_mc <- apply(destination_xy, 2, mean)

origin_mc
       X        Y 
28490.57 36939.04 
destination_mc
       X        Y 
28870.96 36590.49 

The results show that the origin and destination mean centres are, respectively, (28490.57, 36939.04) and (28870.96, 36590.49). The two mean centres appear to be situated in close proximity to one another.

6.1.2 Median Center

Median center is the location that minimizes the sum of distances required to travel to all points within an observation window. It can be calculated using an iterative procedure first presented by Kulin and Kuenne (1962). The procedure begins at a predetermined point, such as the median center, as the initial point. Then, the algorithm updates the median center’s new coordinates (x’, y’) continually until the optimal value is reached. The median center, as opposed to the mean center, offers a more reliable indicator of central tendency as it is unaffected by outliers (Yuan et al., 2020).

origin_medc <- apply(origin_xy, 2, median)

destination_medc <- apply(destination_xy, 2, median)

origin_medc
       X        Y 
28553.17 36179.05 
destination_medc
       X        Y 
28855.04 35883.86 

Based on the results, the median centres of origin and destination are, respectively, (28553.17, 36179.05) and (28855.04, 35883.86). The two median centres appear to be situated in close proximity to one another.

Moreover, mean centers and median centers for each origin and destination points are similar. This may imply that the distribution of the data is relatively balanced and there is not a significant difference in the spatial patterns between the origin and destination points. Additionally, this indicates that both the mean center and median center are effective measures for analyzing the central tendency of the data in this context.

6.1.3 Plotting Mean and Median Centers

We can try to plot both results obtained from previous section on the same plane for better comparison of the mean center and median center.

par(mar = c(0,0,1,0))

plot(sg_sf, col='light grey', main="mean and median centers of origin_sf")
points(origin_xy, cex=.5)
points(cbind(origin_mc[1], origin_mc[2]), pch='*', col='red', cex=3)
points(cbind(origin_medc[1], origin_medc[2]), pch='*', col='purple', cex=3)

par(mar = c(0,0,1,0))

plot(sg_sf, col='light grey', main="mean and median centers of destination_sf")
points(destination_xy, cex=.5)
points(cbind(destination_mc[1], destination_mc[2]), pch='*', col='yellow', cex=3)
points(cbind(destination_medc[1], destination_medc[2]), pch='*', col='orange', cex=3)

6.2 Measuring Dispersion

6.2.1 Standard Distance

Standard distances are defined similarly to standard deviations. This indicator measures how dispersed a group of points is around its mean center (Gimond, 2023).

origin_sd <- sqrt(sum((origin_xy[,1] - origin_mc[1])^2 + (origin_xy[,2] - origin_mc[2])^2) / nrow(origin_xy))

destination_sd <- sqrt(sum((destination_xy[,1] - destination_mc[1])^2 + (destination_xy[,2] - destination_mc[2])^2) / nrow(destination_xy))

origin_sd
[1] 10187.88
destination_sd
[1] 9545.69

From the results, the origin and destination standard distances are 10187.88 and 9545.69, respectively. Hence, it appears that origin points are more dispersed than the origin points.

Reflection

However, it would be challenging to discern why the origin points are more dispersed without further analysis. Further analysis would be needed to determine the factors contributing to the increased dispersion of destination points. Since it is out of scope for this exercise, we will not explore any further.

6.2.2 Plotting Standard Distances

In this section, we will create bearing circles of origin and destination points using the standard distance values we have calculated earlier. This can provide visual representation of their dispersion and make intuitive comparison between them.

par(mar = c(0,0,1,0))
plot(sg_sf, col='light grey', main="standard distance of origin_sf")
points(origin_xy, cex=.5)
points(cbind(origin_mc[1], origin_mc[2]), pch='*', col='red', cex=3)

bearing <- 1:360 * pi/180
cx <- origin_mc[1] + origin_sd * cos(bearing)
cy <- origin_mc[2] + origin_sd * sin(bearing)
circle <- cbind(cx, cy)
lines(circle, col='red', lwd=2)

par(mar = c(0,0,1,0))
plot(sg_sf, col='light grey',main="standard distance of destination_sf")
points(destination_xy, cex=.5)
points(cbind(destination_mc[1], destination_mc[2]), pch='*', col='purple', cex=3)

bearing <- 1:360 * pi/180
cx <- destination_mc[1] + destination_sd * cos(bearing)
cy <- destination_mc[2] + destination_sd * sin(bearing)
circle <- cbind(cx, cy)
lines(circle, col='purple', lwd=2)

A better comparison of the standard distances between origin and destination points can also be achieved by trying to plot both results on the same plane.

par(mar = c(0,0,1,0))

plot(sg_sf, col='light grey',main="standard distances of origin_sf & destination_sf")
points(cbind(origin_mc[1], origin_mc[2]), pch='*', col='red', cex=3)
points(cbind(destination_mc[1], destination_mc[2]), pch='*', col='purple', cex=3)

bearing <- 1:360 * pi/180

origin_cx <- origin_mc[1] + origin_sd * cos(bearing)
origin_cy <- origin_mc[2] + origin_sd * sin(bearing)

destination_cx <- destination_mc[1] + destination_sd * cos(bearing)
destination_cy <- destination_mc[2] + destination_sd * sin(bearing)

origin_circle <- cbind(origin_cx, origin_cy)
destination_circle <- cbind(destination_cx, destination_cy)

lines(origin_circle, col='red', lwd=2)
lines(destination_circle, col='purple', lwd=2)

6.3 Spatial Randomness Test

Clark and Evans (1954) give a very simple test of spatial randomness called Clark and Evans aggregation index (R). It is the ratio of the observed mean nearest neighbour distance in the pattern to that expected for a Poisson point process of the same intensity. R-value >1 suggests ordering, while R-value <1 suggests clustering.

We will perform the Clark-Evans test of aggregation for a spatial point pattern by using clarkevans.test() of statspat.

6.3.1 Spatial Randomness Test for Origin Points

The test hypotheses are:

  • H0 = The distribution of trajectory original points are randomly distributed.

  • H1= The distribution of trajectory original points are not randomly distributed.

The 95% confidence interval will be used.

clarkevans.test(origin_ppp_sg,
                correction="none",
                clipregion="sg_owin",
                alternative=c("clustered"),
                nsim=99)

    Clark-Evans test
    No edge correction
    Z-test

data:  origin_ppp_sg
R = 0.27408, p-value < 2.2e-16
alternative hypothesis: clustered (R < 1)

The Clark-Evans test for the origin points shows an R-value of 0.27408, which is less than 1. This indicates a clustered distribution. The p-value is less than 2.2e-16, which is extremely small and less than the significance level of 0.05. This means that we will reject the null hypothesis (H~0) and accept the alternative hypothesis (H~1). Therefore, the statistical inference from this test is that the original points are not randomly distributed but are clustered. This suggests that there may be underlying factors influencing the spatial distribution of these points.

6.3.2 Spatial Randomness Test for Destination Points

The test hypotheses are:

  • H0 = The distribution of trajectory destination points are randomly distributed.

  • H1= The distribution of trajectory destination points are not randomly distributed.

The 95% confidence interval will be used.

clarkevans.test(destination_ppp_sg,
                correction="none",
                clipregion="sg_owin",
                alternative=c("clustered"),
                nsim=99)

    Clark-Evans test
    No edge correction
    Z-test

data:  destination_ppp_sg
R = 0.29484, p-value < 2.2e-16
alternative hypothesis: clustered (R < 1)

The Clark-Evans test for the destination points also shows an R-value of 0.29484, which is less than 1. This indicates a clustered distribution. The p-value is less than 2.2e-16, which is significantly smaller than the significance level of 0.05. Therefore, we reject the null hypothesis (H~0) and accept the alternative hypothesis (H~1). Therefore, the statistical inference from this test is that the destination points are not randomly distributed but are clustered. This suggests that there may be underlying factors influencing the spatial distribution of these points.

7.0 First-Order Spatial Point Patterns Analysis

After data wrangling is complete, we will start to perform first-order spatial point pattern analysis using functions from spatstat package. As we have discussed in Section 2.0., First-order properties concern the characteristics of individual point locations and their variations of their density across space and are mostly addressed by density-based techniques, such as quadrant analysis and kernel density estimation.

Investigation of the intensity of a point pattern is one of the first and most important steps in point pattern analysis (Baddeley et al., 2015). If the point process has an intensity function λ(u), this function can be estimated non-parametrically by kernel estimation (Baddeley et al., 2015). Kernel estimation allows for smoothing of the probability density estimation of a random variable (in this analysis a point event) based on kernels as weights.

7.1 Rescaling origin_ppp_sg and destination_ppp_sg

The SVY21 Coordinate References System uses meters as the standard unit. Hence, the original_ppp_sg and destination_ppp_sg that we have prepared in the previous sections has “metres” as the unit. However, we will need to convert the measuring unit from metre to kilometeres when calculating the kernel density estimators for entirety of Singapore because kilometers provide a more appropriate scale for analyzing large areas.

origin_ppp_sg.km <- rescale(origin_ppp_sg, 1000, "km")
destination_ppp_sg.km <- rescale(destination_ppp_sg, 1000, "km")

7.2 Computing Default Kernel Density Estimation

Kernel Destiny Estimation (KDE) generates a surface (raster) representing the estimated distribution of point events over the observation window. Each cell in the KDE layer carries a value representing the estimated density of that location (Wilkin, 2020). Hence, this approach is also known as local density approach. To build the KDE layer, a localised density is calculated for multiple small subsets of the observation window. However, these subsets overlap throughout each iteration, resulting in a moving window defined by a kernel (Wilkin, 2020; Gimond, 2023).

In this section, we will focus on destination points as we would like to identify areas that exert a ‘pull’ effect on people, hence resulting in cluster of trajectory destinations. Analyzing the destination of Grab trajectories can provide interesting insights into pull factors within a given area and help identify popular destinations or areas of high mobility demand.

Kernel estimation is implemented in spatstat by the function density.ppp(), a method for the generic command density.

par(mar = c(0,1,1,1))
kde_default_destination <- density(destination_ppp_sg.km)
plot(kde_default_destination,main = "Default Density KDE for Destination Points")
contour(kde_default_destination, add=TRUE)

sigma argument in density() function controls the bandwidth of kernel function. The choice of the bandwidth affects the kernel density estimation strongly. A smaller bandwidth will produce a finer density estimate with all little peaks and valleys. A larger bandwidth will result into a smoother distribution of point densities. Generally speaking, if the bandwidth is too small the estimate is too noisy, while if bandwidth is too high the estimate may miss crucial elements of the point pattern due to over-smoothing (Scott, 2009).

Density Estimates with Different Smoothing Bandwidth (Ref, Baddeley et al., 2015)

When sigma value is not specified, an isotropic Gaussian kernel will be used, with a default value of sigma calculated by a simple rule of thumb that depends only on the size of the window. Hence, the KDE given by default argument may not be what we aim to achieve. Looking at the KDE plot we have created above, there are signs of oversmoothing where only a single spatial cluster in the CBD area being observable. This can severely limit our analysis as potential small-scale clusters and other interesting details are being masked by the oversmoothing effect.

To overcome this challenge, we can specify smoothing bandwidth through the argument sigma or kernel function through the argument kernel to compute and plot more intuitive and detailed KDE maps.

7.3 Creating KDE Layers with Fixed Bandwidth

7.3.1 Computing Fixed Bandwidths Using Different Bandwidth Selection Methods

density() function of spatstat allows us to compute a kernel density for a given set of point events.

  • bw.diggle() Cross Validated Bandwidth Selection for Kernel Density: In this method, band-width σ is chosen to minimize the mean-square error criterion defined by Diggle (1985). The mean-square error is a measure of the average of the squares of the errors - that is, the average squared difference between the estimated values and the actual value.

  • bw.CvL() Cronie and van Lieshout’s Criterion for Bandwidth Selection for Kernel Density: The bandwidth σ is chosen to minimize the discrepancy between the area of the observation window and the sum of reciprocal estimated intensity values at the points of the point process. This method aims to choose a bandwidth that best represents the underlying point process, taking into account both the observed points and the area they occupy.

  • bw.scott() Scott’s Rule for Bandwidth Selection for Kernel Density: The bandwidth σ is computed by the rule of thumb of Scott (1992). The bandwidth is proportional to \(n^{-1/(d-4)}\) where n is the number of points and d is the number of spatial dimensions. This rule is very fast to compute. It typically produces a larger bandwidth than Diggle’s method. It is useful for estimating gradual trend.

  • bw.ppl() Likelihood Cross Validation Bandwidth Selection for Kernel Density: This approach, explained by Loader (1999), uses likelihood cross-validation to determine the bandwidth (σ) by maximizing the point process likelihood. This method is beneficial when the aim is to maximize the likelihood of observing the given data.

bw_diggle <- bw.diggle(destination_ppp_sg.km)
bw_diggle
      sigma 
0.008317447 
bw_CvL <- bw.CvL(destination_ppp_sg.km)
bw_CvL
   sigma 
3.745658 
bw_scott <- bw.scott(destination_ppp_sg.km)
bw_scott
  sigma.x   sigma.y 
1.4763217 0.9063352 
bw_ppl <- bw.ppl(destination_ppp_sg.km)
bw_ppl
    sigma 
0.1913655 
Reflection

We notice that bw_diggle, bw_CvL and bw_ppl all give a numeric sigma value, whereas bw_scott, by default, provides a separate bandwidth for each coordinate axis. In the code output above, sigma.x = 1.4763217 and sigma.y = 0.9063352 are the estimated bandwidths for the x and y coordinates, respectively. These values represent the amount of smoothing applied in each direction when estimating the kernel density.

We can specify isotropic=TRUE argument when calculating bw_scott() method to produce a single value bandwidth.

bw_scott_single <- bw.scott(destination_ppp_sg.km, isotropic=TRUE)
bw_scott_single 
   sigma 
1.156738 

The optimized bandwidth values generated from above methods belongs to the special class bw.optim. The plot function can be used to see the objective function for the optimisation that leads to the result.

par(mfrow = c(1,2))
plot(bw_diggle, xlim=c(-0.02,0.05), ylim=c(-60,200))
plot(bw_CvL)

par(mfrow = c(1,2))
plot(bw_scott, main="bw_scott")
plot(bw_ppl,  xlim=c(-1,5), ylim=c(70000,130000))

7.3.2 Plotting Fixed-Bandwidth KDE Layers

In practice, there are no definite method to choose the KDE bandwidth. Many literature has outlined a diverse range of approaches for KDE bandwidth selection. According to Wolff and Asche (2009), the choice of bandwidth in many existing studies is mostly conducted by visually comparing different bandwidth setting.

Hence, we will now create KDE layers based on each bandwidth selection method and visualize them to have a better comparison of how distinct the resulting KDE layers are.

kde_diggle <- density(destination_ppp_sg.km, bw_diggle)
kde_CvL <- density(destination_ppp_sg.km, bw_CvL)
kde_scott <- density(destination_ppp_sg.km, bw_scott)
kde_ppl <- density(destination_ppp_sg.km, bw_ppl)

par(mar = c(1,1,1,1.5),mfrow = c(2,2))
plot(kde_diggle,main = "kde_diggle")
plot(kde_CvL,main = "kde_CvL")
plot(kde_scott,main = "kde_scott")
plot(kde_ppl,main = "kde_ppl")

Next, we will try to plot histograms to compare the distribution of KDE values obtained from density() function using different bandwidth selection methods.

par(mar = c(2,2,2,2),mfrow = c(2,2))
hist(kde_diggle,main = "kde_diggle")
hist(kde_CvL,main = "kde_CvL")
hist(kde_scott,main = "kde_scott")
hist(kde_ppl,main = "kde_ppl")

Reflection

We can interpret the outputs as below:

  1. kde_diggle: The sharp peak at the beginning indicates that the Diggle method for bandwidth selection has indentified a high concentration of points in the first bin. The rest of the bins has little to no concentration. This may suggest that one specific area in our observation window has observed a relatively high spatial clustering than the rest of the window.

  2. kde_CvL: The more balanced distribution suggests that the CvL method for bandwidth selection is identifying a broader range of spatial point concentration. However, the bin sizes are quite small, which smooths out the overall distribution and masks some of the finer details.

  3. kde_scott: The wider range of values and less sharp peak compared to kde_diggle indicate that the Scott method is capturing a wider range of spatial point concentrations, including both densely concentrated locations and moderately concentrated ones.

  4. kde_ppl: The result is very similar to the Diggle method, the sharp peak at the beginning suggests a high concentration of points in a specific area, suggest that one specific area in our observation window has observed a relatively high spatial clustering than the rest of the area.

Another apporach to compare the KDE layers is to calculate the standard error of each density estimation. $SE is used to extract the standard error of the density estimate from the output of the density() function

dse_diggle <- density(destination_ppp_sg.km, bw_diggle, se=TRUE)$SE
dse_CvL <- density(destination_ppp_sg.km, bw_CvL, se=TRUE)$SE
dse_scott <- density(destination_ppp_sg.km, bw_scott, se=TRUE)$SE
dse_ppl <- density(destination_ppp_sg.km, bw_ppl, se=TRUE)$SE
par(mar = c(1,1,1,1.5),mfrow = c(2,2))
plot(dse_diggle,main = "standard error_diggle")
plot(dse_CvL,main = "standard error_CvL")
plot(dse_scott,main = "standard error_scott")
plot(dse_ppl,main = "standard error_ppl")

Reflection

The standard error (SE) of the density estimate provides a measure of the uncertainty associated with the density estimate at each point. This can be useful for understanding the variability of density estimates, especially when comparing density estimates obtained using different bandwidths.

However, in many applications of KDE, the focus is often on the shape of the density estimate rather than its absolute value. In such cases, the standard error might not be as relevant. Hence, in this analysis, we will not use standard error as a criteria for choosing the bandwidth.

7.3.3 Choosing Fixed KDE Bandwidth Selection Method

Upon the exploration of various fixed bandwidth selection methods for computing KDE vales, and subsequent plotting of the respective KDE estimates, their distributions and associated standard errors, we will now select the KDE bandwidth to be used in our analysis. As we have seen in Section 7.3.1.2, each KDE bandwidth method has produced a distinct KDE and there is no definite method to choose the KDE bandwidth.

We will proceed to choose bw_scott method for further analysis. This is because:

  1. bw_scott method provides a pair of bandwidth values for each coordinate axis. This allows it to capture the different levels of spatial clustering in each direction more accurately.
  2. bw_scott method capture the balance between bias and variance the best among all methods. If the bandwidth is too small, the estimate may be too skewed (high variance). The distribution histograms of KDE layers using bw_diggle and bw_ppl tend to indicate such nature. On the other hand, if the bandwidth is too large, the estimate may be oversmoothed, missing crucial elements of the point pattern (high bias). This is what we observed in the distribution histogram of KDE layer using bw_CvL.

Since we have chosen to use bw_scott method, now we will plot the KDE layer using this method for further analysis.

par(mar = c(0,1,1,1))
bw_fixed_scott <- bw.scott(destination_ppp_sg.km)
bw_fixed_scott
  sigma.x   sigma.y 
1.4763217 0.9063352 
kde_fixed_scott <- density(destination_ppp_sg.km, bw_fixed_scott)
plot(kde_fixed_scott,main = "Fixed-Bandwidth KDE for Grab Destination Points (Using bw_scott)")
contour(kde_fixed_scott, add=TRUE)

However, upon visual inspection, there are signs of a certian degree of over-smoothing when we directly use the bandwidth provide by bw_scott method. Automatic bandwidth selection methods provides a starting point for bandwidth selection, and further fine-tuning might be necessary based on the results of the plot we have created above.

To do so, we will use rule of thumb adjustment by diving the bandwidth value by 2, to reduce the bandwidth size, and hence possible over-smoothing effect.

par(mar = c(0,1,1,1))
kde_fixed_scott <- density(destination_ppp_sg.km, sigma=bw_fixed_scott/2)
plot(kde_fixed_scott,main = "Fixed-Bandwidth KDE for Grab Destination Points (Using bw_scott)")
contour(kde_fixed_scott, add=TRUE)

Looking at the plot created, it appears that by reducing the bandwidth (thus making the point cluster buffers smaller), the over-smoothing effects have been minimized. However, we still can observe the Grab destination hotspot areas, which we can investigate further in subsequent sections.

7.3.4 Kernel Function Selection for Fixed Bandwidth

Kernel functions are another consideration in KDE computation because they control how we weight points within the bandwidth radius. The default kernel in density.ppp() is the gaussian. alternatives such as epanechnikov, quartic, and disc are also available.

In this section, we will explore and experiment with different kernel functions and plot them together to see how it influences our KDE map results.

kde_fixed_scott.gaussian <- density(destination_ppp_sg.km, 
                          sigma=bw_fixed_scott, 
                          edge=TRUE, 
                          kernel="gaussian")


kde_fixed_scott.epanechnikov <- density(destination_ppp_sg.km, 
                          sigma=bw_fixed_scott, 
                          edge=TRUE, 
                          kernel="epanechnikov")
   
kde_fixed_scott.quartic <- density(destination_ppp_sg.km, 
                          sigma=bw_fixed_scott, 
                          edge=TRUE, 
                          kernel="quartic")
       
   
kde_fixed_scott.disc <- density(destination_ppp_sg.km, 
                          sigma=bw_fixed_scott, 
                          edge=TRUE, 
                          kernel="disc")
         
par(mar = c(1,1,1,1.5),mfrow = c(2,2))
plot(kde_fixed_scott.gaussian, main="Gaussian")
plot(kde_fixed_scott.epanechnikov, main="Epanechnikov")
plot(kde_fixed_scott.quartic, main="Quartic")
plot(kde_fixed_scott.disc, main="Disc")

Xie & Yan (2008) suggested that the choice of kernel function has minimal impact on density estimation outcomes. Shen et al. (2020) further identified the search bandwidth as a more influential factor in kernel estimation than the selection of different kernel functions. Empirically looking at the KDE maps we have created above using different kernel functions, similar conclusion can be made. Despite the slight variations in smoothness and spread, all four plots show similar patterns of density estimation. This underscores the argument that the choice of kernel function does not significantly impact KDE results. Consequently, we will not focus this aspect in our analysis.

7.4 Creating KDE Layers with Spatially Adaptive Bandwidth

Fixed bandwidth kernels are commonly employed in statistical literature due to their ease of implementation. However, their application in spatial datasets often yields suboptimal estimations due to a lack of spatial and temporal adaptability (González & Moraga, 2022). Consequently, the more intuitive ‘adaptive smoothing’ approach has emerged. In this technique, the amount of smoothing is inversely related to the density of the points.

In spatstat packages, there are three main approaches (Pebesma & Bivand, 2023) in creating KDE layers with spatially adapative bandwidth.

  • Voronoi-Dirichlet Adaptive Density Estimate: It computes the intensity function estimate of a point pattern dataset by creating tessellations. On default, the input point pattern data is used to construct a Voronoi/Dirichlet tessellation (Barr and Schoenberg, 2010). The intensity estimate at a given location equals the reciprocal of the size of the Voronoi/Dirichlet cell containing that location.

  • Adaptive Kernel Density Estimate: It computes an estimate of the intensity function of a point pattern dataset using the partitioning technique of Davies and Baddeley (2018). It dynamically specifies the smoothing bandwidth to be applied to each of the points. The partitioning method of Davies and Baddeley (2018) accelerates this computation by partitioning the range of bandwidths into n-groups intervals, correspondingly subdividing the point patterns into n-groups, sub-patterns according to bandwidth, and applying fixed-bandwidth smoothing to each sub-pattern.

  • Nearest-Neighbour Adaptive Density Estimate: It computes an estimate of the intensity function of a point pattern dataset using the distance from each spatial location to the kth nearest points (Cressie, 1991: Silverman, 1986;Burman & Nolan, 1989). The default value of k is the square root of the number of points in the dataset. This estimator of intensity is relatively fast to compute and is spatially adaptive. Some studies suggest the use of the nearest neighbor distance as a suitable parameter for determining the bandwidth (Krisp et al., 2009).

7.4.1 Voronoi-Dirichlet Adaptive Density Estimate

The Dirichlet-Voronoï estimator is computed in spatstat by the function adaptive.density() with argument method="voronoi".

kde_destination_dirichlet_adaptive <- adaptive.density(destination_ppp_sg.km, f=1, method = "voronoi")
par(mar = c(0,1,1,1))
plot(kde_destination_dirichlet_adaptive,main = "Voronoi-Dirichlet Adaptive Density Estimate")

7.4.2 Adaptive Kernel Density Estimate

The Adaptive Kernel estimator is computed in spatstat by the function adaptive.density() with argument method="kernel".

kde_destination_kernel_adaptive <- adaptive.density(destination_ppp_sg.km, method = "kernel")
par(mar = c(0,1,1,1))
plot(kde_destination_kernel_adaptive,main = "Adaptive Kernel Density Estimate")

7.4.3 Nearest-Neighbour Density Estimate

The Nearest-Neighbour estimator is computed in spatstat by the function nndensity().

kde_adaptive_nn <- nndensity(destination_ppp_sg.km, k=10)
par(mar = c(0,1,1,1))
plot(kde_adaptive_nn,main = "Nearest-Neighbour Adaptive Density Estimate")

7.4.4 Choosing Adaptive KDE Method

Similar to what we did for fixed bandwidth, we can try to plot histograms to compare the distribution of KDE values obtained from density() function using different adaptive bandwidth selection methods.

par(mar = c(2,2,2,2),mfrow = c(2,2))
hist(kde_destination_dirichlet_adaptive,main = "Voronoi-Dirichlet Adaptive")
hist(kde_destination_kernel_adaptive,main = "Adaptive Kernel")
hist(kde_adaptive_nn,main = "Nearest-Neighbour Adaptive")

From the outputs, it seems that there is no significance difference between the distribution of KDE values obtained across different methods. All three methods identified a high concentration of points in a specific area. Hence, we will choose to go with Adapative Kernel method because it provides the most

7.5 Plotting Interactive KDE Maps

raster_kde_fixed_scott <- raster(kde_fixed_scott)
raster_kde_adaptive_nn <- raster(kde_adaptive_nn)
raster_kde_adaptive_kernel <- raster(kde_destination_kernel_adaptive)
projection(raster_kde_fixed_scott) <- CRS("+init=EPSG:3414 +units=km")
projection(raster_kde_adaptive_nn) <- CRS("+init=EPSG:3414 +units=km")
projection(raster_kde_adaptive_kernel) <- CRS("+init=EPSG:3414 +units=km")
tmap_mode('view')
kde_fixed_scott <- tm_basemap(server = "OpenStreetMap.HOT") +
  tm_basemap(server = "Esri.WorldImagery") +
  tm_shape(raster_kde_fixed_scott) +
  tm_raster("layer",
            n = 10,
            title = "KDE_Fixed_scott",
            alpha = 0.6,
            palette = c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e")) +
  tm_shape(mpsz_sf)+
  tm_polygons(alpha=0.1,id="PLN_AREA_N")+
  tmap_options(check.and.fix = TRUE)

tmap_mode('view')
kde_adaptive_nn <- tm_basemap(server = "OpenStreetMap.HOT") +
  tm_basemap(server = "Esri.WorldImagery") +
  tm_shape(raster_kde_adaptive_nn) +
  tm_raster("layer",
            n = 7,
            title = "KDE_Adaptive_nn",
            style = "pretty",
            alpha = 0.6,
            palette = c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e")) +
  tm_shape(mpsz_sf)+
  tm_polygons(alpha=0.1,id="PLN_AREA_N")+
  tmap_options(check.and.fix = TRUE)

tmap_mode('view')
kde_adaptive_kernel <- tm_basemap(server = "OpenStreetMap.HOT") +
  tm_basemap(server = "Esri.WorldImagery") +
  tm_shape(raster_kde_adaptive_kernel) +
  tm_raster("layer",
            n = 7,
            title = "KDE_Adaptive_Kernel",
            style = "pretty",
            alpha = 0.6,
            palette = c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e")) +
  tm_shape(mpsz_sf)+
  tm_polygons(alpha=0.1,id="PLN_AREA_N")+
  tmap_options(check.and.fix = TRUE)

tmap_arrange(kde_fixed_scott, kde_adaptive_nn, kde_adaptive_kernel ,ncol=1,nrow=3,sync = TRUE)

7.6 Analysis of Singapore-Level Kernel Density Estimation Maps

FromThe highest cluster of Grab taxi drop-off points are seen near Changi Airport, where the concentrations reach up to 450.

The central and southern parts of Singapore, particularly the Central Business District (CBD) and Marina Bay areas, also exhibit substantial concentrations of Grab taxi drop-off points, peaking at 350. These areas have a high demand for taxi services, possibly due to a high concentration of businesses, tourist attractions.

Interestingly enough, we found a few residential subzones outside of the CBD region, which is comparatively higher density values compared to the rest of the island. However, because of the smoothing effect, identifying these hotspots using a fixed-bandwidth KDE map is difficult. Hence, we proceeded to perform a cross inspection with adaptive KDE maps for better precision in hotspots identification.

Through adaptive nearest neighbour KDE map, we identified distinct hotspots in Jurong West (7000-8000), Woodlands (6000-7000), Tampines (4000-5000), and Toa Payoh (4000-5000). The Adaptive Kernel KDE map complements this finding by accentuating two additional hotspots, Jurong East (500-1000) and Punggol (500-1000).

Source: URA Master Plan 2019

It is intriguing to see that the six residential subzones we have identified have higher concentrations of drop-off points, despite being predominantly classified as residential areas , according to the Master Plan 2019 (MP19) by Urban Redevelopment Authority Singapore. It will be interesting to discern underlying spatial process and pull factors that lead to this spatial cluster. Therefore, we will attempt to create comparable KDE maps at the planning subzone level.

7.7 Planning Area-Level Kernel Density Estimation

In this section, we will create planning-area level KDE maps for six planning areas we have identified. In order to create such maps, we will carry out additional data wrangling as required.

Firstly, we will filter out different planning areas as separate sf objects from mpsz_sf.

wl = mpsz_sf %>% filter(PLN_AREA_N == "WOODLANDS")
je = mpsz_sf %>% filter(PLN_AREA_N == "JURONG EAST")
jw = mpsz_sf %>% filter(PLN_AREA_N == "JURONG WEST")
tn = mpsz_sf %>% filter(PLN_AREA_N == "TAMPINES")
tp = mpsz_sf %>% filter(PLN_AREA_N == "TOA PAYOH")
pg = mpsz_sf %>% filter(PLN_AREA_N == "PUNGGOL")


par(mar = c(1,1,1,0),mfrow=c(2,3))
plot(st_geometry(wl), main = "Woodlands")
plot(st_geometry(je), main = "Jurong East")
plot(st_geometry(jw), main = "Jurong West")
plot(st_geometry(tn), main = "Tampines")
plot(st_geometry(tp), main = "Toa Payoh")
plot(st_geometry(pg), main = "Punggol")

Next, we will create owin objects to represent the observation windows for respective planning area. Once owin objects are created, we will also filter Grab taxi drop-off points in each observation window from the original destination_ppp_sg ppp object.

wl_owin = as.owin(wl)
je_owin = as.owin(je)
jw_owin = as.owin(jw)
tn_owin = as.owin(tn)
tp_owin = as.owin(tp)
pg_owin = as.owin(pg)

destination_wl_ppp = destination_ppp_sg[wl_owin]
destination_je_ppp = destination_ppp_sg[je_owin]
destination_jw_ppp = destination_ppp_sg[jw_owin]
destination_tn_ppp = destination_ppp_sg[tn_owin]
destination_tp_ppp = destination_ppp_sg[tp_owin]
destination_pg_ppp = destination_ppp_sg[pg_owin]

Now that we have prepared both owin and ppp objects for each planning area, we are ready to plot KDE maps. Similar to what we have done in previous section, we will try both fixed-bandwitdh and adaptive bandwidth KDE maps.

7.7.1 Planning Area-Level Fixed-Bandwidth KDE Maps

wl_kde_scott <- density(destination_wl_ppp, sigma=bw.scott, main="Woodlands")
je_kde_scott <- density(destination_je_ppp, sigma=bw.scott, main="Jurong East")
jw_kde_scott <- density(destination_jw_ppp, sigma=bw.scott, main="Jurong West")
tn_kde_scott <- density(destination_tn_ppp, sigma=bw.scott, main="Tampines")
tp_kde_scott <- density(destination_tp_ppp, sigma=bw.scott, main="Toa Payoh")
pg_kde_scott <- density(destination_pg_ppp, sigma=bw.scott, main="Punggol")
par(mar = c(1,1,1,1.5),mfrow = c(3,2))

plot(wl_kde_scott,main = "Fixed KDE Woodlands")
contour(wl_kde_scott, add=TRUE)
plot(je_kde_scott,main = "Fixed KDE Jurong East")
contour(je_kde_scott, add=TRUE)
plot(jw_kde_scott,main = "Fixed KDE Jurong West")
contour(jw_kde_scott, add=TRUE)
plot(tn_kde_scott,main = "Fixed KDE Tampines")
contour(tn_kde_scott, add=TRUE)
plot(tp_kde_scott,main = "Fixed KDE Toa Payoh")
contour(tp_kde_scott, add=TRUE)
plot(pg_kde_scott,main = "Fixed KDE Punggol")
contour(pg_kde_scott, add=TRUE)

7.7.2 Planning Area-Level Adaptive-Bandwidth KDE Maps

wl_kde_adaptive_kernel <- adaptive.density(destination_wl_ppp, method = "kernel")
je_kde_adaptive_kernel <- adaptive.density(destination_je_ppp, method = "kernel")
jw_kde_adaptive_kernel <- adaptive.density(destination_jw_ppp, method = "kernel")
tn_kde_adaptive_kernel <- adaptive.density(destination_tn_ppp, method = "kernel")
tp_kde_adaptive_kernel <- adaptive.density(destination_tp_ppp, method = "kernel")
pg_kde_adaptive_kernel <- adaptive.density(destination_pg_ppp, method = "kernel")

par(mar = c(1,1,1,1.5),mfrow = c(3,2))

plot(je_kde_adaptive_kernel,main = "Adaptive KDE Woodlands")
contour(je_kde_adaptive_kernel, add=TRUE)
plot(je_kde_adaptive_kernel,main = "Adaptive KDE Jurong East")
contour(je_kde_adaptive_kernel, add=TRUE)
plot(jw_kde_adaptive_kernel,main = "Adaptive KDE Jurong West")
contour(jw_kde_adaptive_kernel, add=TRUE)
plot(tn_kde_adaptive_kernel,main = "Adaptive KDE Tampines")
contour(tn_kde_adaptive_kernel, add=TRUE)
plot(tp_kde_adaptive_kernel,main = "Adaptive KDE Toa Payoh")
contour(tp_kde_adaptive_kernel, add=TRUE)
plot(pg_kde_adaptive_kernel,main = "Adaptive KDE Punggol")
contour(pg_kde_adaptive_kernel, add=TRUE)

7.8 Analysis of Planning Area-Level Kernel Density Estimation Maps

Observations from the planning area-level KDE maps reveal that while KDE maps excel in visualizing and analyzing spatial data, their application to micro-level analysis, like planning subzones, has limitations. Firstly, there are issues with over-smoothing (in fixed KDE maps) and undersmoothing (in adaptative KDE maps) which deters us from discerning meaningful insights. Furthermore, KDE values, generated on grid-pixels using Euclidean distance, result in grid blocks that limit our ability to discern fine details and variations within each subzone.

For example, let’s look at the picture below, which is a snapshot from KDE adaptive nearest-neighbor map. A hotspot is identified and represented by a concentrated red zone near Braddell Road in Toa Payoh. However, just by looking at this map, it’s really hard to understand more about this hotspot. We know there is a cluster of drop-off points, but why are they there? What are pull factors attracting people to this area? It’s almost impossible to answer these questions just by looking at this KDE map.

This limitation underscores the need for more advanced visualization techniques or analytical methods that can provide alternative approaches to understanding spatial point clusters. With this limitation in mind, we will apply a new analytical method called network constrained kernel density estimation (NKDE) to offer more detailed insights into the spatial distribution of Grab taxi drop-off points through the integration of road networks.

8.0 Network Constrained Kernel Density Estimation (NKDE)

In the real world, point events are often not randomly distributed. Instead, their distribution is constrained by networks. When carrying out spatial point pattern analysis, traditional Kernel Density Estimation (KDE) assumes an infinite, homogeneous, two-dimensional space, an approximation that is not accurate for network-based study areas. In such networks, movement is constrained by multiple one-dimensional lines (Gelb, 2021). Network Constrained Kernel Density Estimation (NKDE), therefore, is a widely used approach to identify the hotspots and evaluate origin-destination points along with the road network (Shen et al. 2020).

This approach estimates the intensity of the spatial process solely on the network. The network edges are divided into lixels (one-dimensional pixels), and the centers of the lixels serve as the locations for intensity estimation. Distances between events and sampling points are calculated as the shortest path distances on the network, instead of Euclidean distances. This adjustment slightly modifies the intensity function from the classical KDE function. The adapted formula makes the interpretation straightforward: it “estimates the density over a linear unit” rather than an area unit.

In this section, we will follow-up on the six planning areas we identified in Section 7.6 and attempt to create NKDE maps using spNetwork package.

8.1 Extracting Road Networks for Focus Planning Areas

Before we carry out the anlysis, we will extract relevant datasets required for calculating NKDE values in each planning area:

  • road network

  • destination points

wl_network = st_intersection(sg_driving_sf,st_union(wl))
je_network = st_intersection(sg_driving_sf,st_union(je))
jw_network = st_intersection(sg_driving_sf,st_union(jw))
tn_network = st_intersection(sg_driving_sf,st_union(tn))
tp_network = st_intersection(sg_driving_sf,st_union(tp))
pg_network = st_intersection(sg_driving_sf,st_union(pg))
wl_destination = st_intersection(destination_sf,st_union(wl))
je_destination = st_intersection(destination_sf,st_union(je))
jw_destination = st_intersection(destination_sf,st_union(jw))
tn_destination = st_intersection(destination_sf,st_union(tn))
tp_destination = st_intersection(destination_sf,st_union(tp))
pg_destination = st_intersection(destination_sf,st_union(pg))

8.3 Data Preparation

For illustrative purpose, I’ll use “Punggol” as a example to demonstrate step-by-step data preparation process. Upon completion of this demonstration, maps for the remaining planning areas will be created.

The spNetwork package contains nkde function that is specifically designed to implement Network Constrained Kernel Density Estimation (NetKDE).

The three main inputs are:

  • lixels: a “SpatialLinesDataFrame”, representing the lines of the network.

  • events: a “SpatialPointsDataframe”, representing the realizations of the spatial process.

  • samples: a “SpatialPointsDataframe” providing the locations where the density must be estimated.

8.3.1 Preparing the lixels objects

Before computing NetKDE, the SpatialLines object need to be cut into lixels with a specified minimal distance. This task can be performed by using with lixelize_lines() of spNetwork.

pg_lixels <- lixelize_lines(pg_network, 
                         700, 
                         mindist = 350)

In this code snippet, the length of a lixel is set to 700m, and the minimum length of a lixel, denoted as mindist, is set to 350m. This implies that during segmentation, if the length of the final lixel is shorter than the minimum distance, then it is added to the previous lixel. The segments that are already shorter than the minimum distance are not modified.

8.3.2 Generating line centre points

Next, lines_center() of spNetwork will be used to generate a SpatialPointsDataFrame (i.e. samples) with line centre points. The points are located at center of the line based on the length of the line.

pg_samples <- lines_center(pg_lixels)

8.4 Performing NKDE

8.4.1 Computing Network Constrained Kernel Density Estimation

The spNetwork package offers three methods for calculating NKDE values simple, discontinuous and continuous.

  • Simple NKDE: Simple NKDE method was proposed by Xie and Yan (2008), extending the planar KDE to a network-based model. However, there are issues with this method due to statistical incorrectness. Specifically, at network intersections, the event’s mass is multiplied in each direction, leading to overestimation and inflated density estimates, which can result in misleading interpretations (Gelb, 2021). To address these issues, Okabe et al. (2009) proposed two unbiased estimators: Discontinuous NKDE and Continuous NKDE.

  • Discontinuous NKDE: Discontinuous NKDE aims to resolve density inflation by dividing the mass at intersections according to the number of directions minus one, resulting in an unbiased estimator (Gelb, 2021). However, the discontinuous nature of this method can be counter-intuitive in practical applications.

  • Continuous NKDE: Continuous NKDE proposes to address the limitations of both simple and discontinuous NKDE. It adjusts the values of the NKDE at intersections and applies a backward correction to force the density values to be continuous (Gelb, 2021).

In this analysis, we will compute NKDE values using all three methods and compare the results.

pg_nkde_simple <- nkde(pg_network, 
                  events = pg_destination,
                  w = rep(1,nrow(pg_destination)),
                  samples = pg_samples,
                  kernel_name = "quartic",
                  bw = 200, 
                  div= "bw", 
                  method = "simple", 
                  digits = 1, 
                  tol = 1,
                  grid_shape = c(1,1), 
                  max_depth = 8,
                  agg = 5,
                  sparse = TRUE,
                  verbose = FALSE)
pg_nkde_discontinuous <- nkde(pg_network, 
                  events = pg_destination,
                  w = rep(1,nrow(pg_destination)),
                  samples = pg_samples,
                  kernel_name = "quartic",
                  bw = 200, 
                  div= "bw", 
                  method = "discontinuous", 
                  digits = 1, 
                  tol = 1,
                  grid_shape = c(1,1), 
                  max_depth = 8,
                  agg = 5,
                  sparse = TRUE,
                  verbose = FALSE)
pg_nkde_continuous <- nkde(pg_network, 
                  events = pg_destination,
                  w = rep(1,nrow(pg_destination)),
                  samples = pg_samples,
                  kernel_name = "quartic",
                  bw = 200, 
                  div= "bw", 
                  method = "continuous", 
                  digits = 1, 
                  tol = 1,
                  grid_shape = c(1,1), 
                  max_depth = 8,
                  agg = 5,
                  sparse = TRUE,
                  verbose = FALSE)
Reflection

bw argument refers to the bandwidth used for calculating KDE values. Xie and Yan (2008) suggested that narrow bandwidths (between 20 and 250 m) are more appropriate for identifying local effects at smaller scales. Hence, we use 200 here.

agg argument allows the events to be aggregated, and their weights to be added within a threshold distance. Aggregating events can simplify networks and limit the number of iterations when calculating the NKDE, hence effectively reducing time complexity of computation.

8.4.2 Visualising NKDE Maps Using Different Methods

Before we can visualise the NetKDE values, we will insert the computed density values into samples and lixels objects. To enhance the readability of the results, you will first multiply the obtained densities by the total number of drop-off points. This adjustment ensures that the spatial integral equals the number of events. Subsequently, you will multiply this value by 1000 to derive the estimated number of drop-off points per kilometer. This process will provide a more intuitive understanding of the density distribution along the network.

pg_samples$nkde_simple <- pg_nkde_simple*nrow(pg_destination)*1000
pg_lixels$nkde_simple <- pg_nkde_simple*nrow(pg_destination)*1000

pg_samples$nkde_discontinuous <- pg_nkde_discontinuous*nrow(pg_destination)*1000
pg_lixels$nkde_discontinuous <- pg_nkde_discontinuous*nrow(pg_destination)*1000

pg_samples$nkde_continuous <- pg_nkde_continuous*nrow(pg_destination)*1000
pg_lixels$nkde_continuous <- pg_nkde_continuous*nrow(pg_destination)*1000

Now we can plot NKDE maps using tmap.

tmap_mode('view')
pg_nkde_simple_map <- tm_basemap(server = "Esri.WorldTopoMap") +
  tm_basemap(server = "Esri.WorldGrayCanvas")+
  tm_basemap(server = "OpenStreetMap") +
tm_shape(pg_lixels)+
  tm_lines(col="nkde_simple", lwd = 2, palette =c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e"), id="nkde_continuous")+
tm_shape(pg_destination)+
  tm_dots(size=0.01)

pg_nkde_discontinuous_map <- tm_basemap(server = "Esri.WorldTopoMap") +
  tm_basemap(server = "Esri.WorldGrayCanvas")+
  tm_basemap(server = "OpenStreetMap") +
tm_shape(pg_lixels)+
  tm_lines(col="nkde_discontinuous", lwd = 2, palette =c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e"), id="nkde_continuous")+
tm_shape(pg_destination)+
  tm_dots(size=0.01)

pg_nkde_continuous_map <- tm_basemap(server = "Esri.WorldTopoMap") +
  tm_basemap(server = "Esri.WorldGrayCanvas")+
  tm_basemap(server = "OpenStreetMap") +
tm_shape(pg_lixels)+
  tm_lines(col="nkde_continuous", lwd = 2, palette =c("#fafac3","#fd953b","#f02a75","#b62385","#021c9e"), id="nkde_continuous")+
tm_shape(pg_destination)+
  tm_dots(size=0.01)


tmap_arrange(pg_nkde_simple_map, pg_nkde_discontinuous_map, pg_nkde_continuous_map ,ncol=3,nrow=1,sync = TRUE)
Analysis

Comparing Simple, Discontinuous and Continuous NKDE Maps

Upon close examination to the intersection as shown above, we can clearly see the density inflation caused by simple NKDE method, as opposed to discontinuous and continuous methods. A comparison between the Discontinuous and Continuous NKDE maps reveals the disjointed nature of lixels resulting from the Discontinuous method, as indicated by the varying color bands representing each lixel segment. Out of all three maps, continuous NKDE map offers the most intuitive representation.

8.5 Plotting NKDE for Different Planning Areas

In this section, we will create interactive maps of different planning areas that we have previousely identified as potential hotspots. These maps will incorporate several key elements: network kernel density estimation values (depicted through lixels), Grab drop-off points, and points of interest (POIs). The purpose of including these elements is to identify the pull factors within these areas. Understanding these factors can help in planning and decision-making processes, such as where to locate new facilities or how to improve transportation routes.

8.6 Analysis of Network Constrained Kernal Density Estimation (NKDE) Maps

Overall, It is intriguing to observe the interplay of similar, yet distinct, pull factors across different planning subzones.

  • Punggol: In Punggol, all observed clusters are located near schools and parks. The most significant clusters are found in the vicinity of Edgefield Secondary School, Edgefield Primary School, and Punggol Green Primary School. Multiple smaller clusters are observed around neighbourhood parks.

  • Tampines: In Tampines, a significant cluster is observed at Tanah Merah Country Club and Laguna Country Club, suggesting a high demand for taxi services in these recreational areas. A smaller cluster is also noticeable near Simei Neighbourhood Park.

  • Jurong East: In Jurong East, a significant cluster is observed leading to IMM, a major outlet shopping center, which seems to attract multiple taxi trajectories. Additional clusters are observed near Yuhua Primary School and Crest Secondary School, as well as various neighbourhood parks.

  • Jurong West: In Jurong West, two significant clusters are observed, one near Westwood Secondary School and the other leading to Nanyang Technological University, highlighting the importance of educational institutions in this area. Another equally significant cluster is located near Masjid Maarof, one of Singapore’s oldest and most prominent mosques.

  • Woodlands: In Woodlands, a significant cluster is observed at Innova Junior College and Singapore Sports School. Additional clusters are found at Woodland Civic Centre, a community centre with library and dining areas, and STELLAR@TE2, a lifestyle shopping mall. We also observed another cluster leads to Singapore Turf Club.

  • Toa Payoh: In Toa Payoh, a cluster is observed near Toa Payoh Townpark, indicating its popularity as a communal destination. Another cluster is found at St Andrew’s Junior College.

In conclusion, the results of NKDE analysis of Grab taxi drop-off points in six selected planning areas reveals a nuanced pattern of clusters associated with various types of amenities and institutions. These findings suggest that different planning areas have distinct characteristics that attract people to specific locations within them. Additionally, understanding these pull factors can help inform urban planning and development strategies to create anew or further enhance the attractiveness of these areas.

9.0 Temporal Network Kernel Density Estimation (TNKDE)

Temporal Network Kernel Density Estimation (TNKDE) is an extension of Network Constrained Kernel Density Estimation (NKDE) whereby the the temporal dimension is integrated to calculating density of events on a network (Gelb & Apparicio, 2023). This means that density estimation can be done along lines of the network and at different times. The spatio-temporal kernel is calculated as the product of the network kernel density and the time kernel density.

TNKDE can be calculated in calculated in R environment using tnkde() function from spNetwork package.

9.1 Visualising Frequency Distribution

Before we calculate TNKDE, we will conduct an exploratory analysis of our data from the Grab Posisi dataset. This will involve visualizing the frequency distribution of both our origin and destination point samples. This step will provide us with a preliminary understanding of our data distribution, which is crucial for effective TNKDE calculation.

Firstly, we will analyze the frequency distribution of days from our trajectory data samples. To achieve this, we will plot bar graphs representing the count of trajectories originating and ending on each day of the week

origin_day <- ggplot(data=origin_sf, 
              aes(x=weekday)) + 
              geom_bar()

destination_day <-  ggplot(data=destination_sf, 
                    aes(x=weekday)) + 
                    geom_bar()

origin_day + destination_day

As seen above, each bar graph displays a uniform distribution across all weekdays. The dataset is sampled; hence all days are equally represented in both origin and destination datasets.

9.2 Plotting Frequency of Trip Origination By Hour

Next, we will analyze the frequency distribution of trip origination by hour from our trajectory data samples. To achieve this, we will plot bar graphs representing the count of trajectories originating points in each hour.

origin_sf$pickup_hr_num <- as.numeric(origin_sf$pickup_hr)
pickup_hr_num <- origin_sf$pickup_hr_num

ggplot(data=origin_sf, 
                    aes(x=pickup_hr_num), bins = 24) + 
                    geom_bar() +
    scale_x_continuous(breaks = pickup_hr_num)

We observed noticeable peaks at 3, 5, 14, and 15 hours (3 AM, 5 AM, 2 PM and 3 PM respectively). There’s a significant drop in trip originations during 9 hour (9 AM), as well as from 19 to 22 hours (7 PM to 10 PM). These could be quieter periods in the day when fewer trips are originated. It is also interesting to see a sudden increase of trip origination at 23 hour (11 PM).

9.3 Plotting Frequency of Trip Destination By Hour

Next, we will analyze the frequency distribution of trip destination by hour from our trajectory data samples. To achieve this, we will plot bar graphs representing the count of trajectories destination points in each hour.

destination_sf$dropoff_hr_num <- as.numeric(destination_sf$dropoff_hr)
dropoff_hr_num <- destination_sf$dropoff_hr_num

ggplot(data=destination_sf, 
                    aes(x=dropoff_hr_num), bins = 24) + 
                    geom_bar() +
    scale_x_continuous(breaks = dropoff_hr_num)

Somewhat similar to origination points, we observed noticeable peaks at 3, 13, and 14 hours (3 AM, 1 PM, and 2 PM respectively). There’s a significant drop in trip desintation during 9, 18, 21 and 23 hours (9 AM, 6 PM, 9 PM and 11 PM respectively).

Reflection

While the frequency plots of trip origination and destination points based on temporal segments reveal intriguing patterns, deriving meaningful insights from these can be challenging. This is primarily due to the nature of the dataset. As a sample dataset, the data points are randomly collected within a specific time window. Consequently, these data may not accurately represent the true temporal variations in traffic flows. Therefore, while the plots provide a snapshot of the data, they may not fully capture the complexity and variability of real-world traffic patterns.

9.4 Time-Bandwidth Selection

We can now calculate the kernel density values of destination points in time for several bandwidths. When calculating TNKDE, two bandwidths are necessary, one for space and one for time. In this section, we will try to explore three different bandwidth selection methods:

  • bw_bcv: implements biased cross-validation for bandwidth selection in kernel density estimation. The goal of cross-validation in this context is to find the bandwidth that minimizes the estimation error.

  • bw_ucv: implements cross-validation for bandwidth selection, but in an unbiased manner.

  • bw_SJ: implements the methods of Sheather & Jones (1991) to select the bandwidth using pilot estimation of derivatives.

To achieve this, we will follow the steps as below:

We will first create a vector w of length equal to the number of rows in the destination_sf data frame. Each element of the vector is set to 1. This vector will be used as weights in calculating TKDE.

w <- rep(1,nrow(destination_sf))

We will then create a sequence of numbers from 0 to the maximum value of the dropoff_hr_num column in the destination_sf data frame, with a step size of 0.5. These are the sample points where the density will be estimated.

samples <- seq(0, max(destination_sf$dropoff_hr_num), 0.5)

Next, we will calculate bandwidth values for dropoff_hr_num column of destination_sf using 3 bandwidth selection methods discussed above.

bw_bcv <- bw.bcv(destination_sf$dropoff_hr_num)
bw_ucv <- bw.ucv(destination_sf$dropoff_hr_num)
bw_SJ <- bw.SJ(destination_sf$dropoff_hr_num)

Once bandwidth values are calculated, we will proceed to implement TNKDE analysis using tkde() function from spPackage. Then, we will create a dataframe time_kernel_values to store these density values.

time_kernel_values <- data.frame(
  bw_bcv = tkde(destination_sf$dropoff_hr_num, w = w, samples = samples, bw = bw_bcv, kernel_name = "quartic"),
  bw_ucv = tkde(destination_sf$dropoff_hr_num, w = w, samples = samples, bw = bw_ucv, kernel_name = "quartic"),
  bw_SJ = tkde(destination_sf$dropoff_hr_num, w = w, samples = samples, bw = bw_SJ, kernel_name = "quartic"),
  time = samples
)

Next, we will use melt function from the reshape2 package to transform time_kernel_values dataframe into a new dataframe called df_time, using time as the identifier variable. We will also convert the variable column of the df_time data frame to a factor, for plotting in ggplot2.

df_time <- reshape2::melt(time_kernel_values,id.vars = "time")
df_time$variable <- as.factor(df_time$variable)

Finally, we will use ggplot2 to create a line plot of the TKDE for each bandwidth.

ggplot(data = df_time) + 
  geom_line(aes(x = time, y = value)) +
  scale_x_continuous(breaks = dropoff_hr_num) +
  facet_wrap(vars(variable), ncol=2, scales = "free")  + 
  theme(axis.text = element_text(size = 5))

Reflection

What does each line plot say about each bandwidth selection approach?

In all three plots, the line represents the estimated density of drop-offs at each time point. At a glance, bw_ucv and bw_SJ methods seem to give a more detailed and potentially noisier pattern than bw_bcv. Unlike the other two, bw_bcv approach shows a relatively smooth curve with minor fluctuations. The differences in these plots highlight how the choice of bandwidth can significantly affect the resulting density estimate.

9.5 TNKDE Implementing at Punggol Planning Area

9.5.1 Data Wrangling and Exploratory Analysis

Now, we will try to construct a TNKDE map for Punggol Planning Area. Before running TNKDE, we will carry our some preliminary data wrangling and exploratory analysis to ensure our dataset is in the appropriate format.

pg_dropoff_hr_num <- pg_destination$dropoff_hr_num

ggplot(data=pg_destination, 
                    aes(x=pg_dropoff_hr_num), bins = 24) + 
                    geom_bar() +
    scale_x_continuous(breaks = pg_dropoff_hr_num)

From the graph above, we observed noticeable peaks at 1, 6, 8, 13, and 20 hours (1 AM, 6 AM, 8 AM, 1 PM, and 8 PM respectively). These peaks suggest that these time intervals might be particularly active or important. To investigate further, we will filter out the destination points that fall within these specific hour intervals and use tmap to visualize these filtered data points.

pg_dropoff_1 <- filter(pg_destination, dropoff_hr_num == '1')
pg_dropoff_6 <- filter(pg_destination, dropoff_hr_num == '6')
pg_dropoff_8 <- filter(pg_destination, dropoff_hr_num == '8')
pg_dropoff_13 <- filter(pg_destination, dropoff_hr_num == '13')
pg_dropoff_20 <- filter(pg_destination, dropoff_hr_num == '20')
tmap_mode("view")
tm_basemap(server = "Esri.WorldTopoMap") +
tm_shape(pg_lixels)+
  tm_lines()+
tm_shape(pg_dropoff_1)+
  tm_dots(size=0.01) +
  tm_shape(pg_dropoff_6)+
  tm_dots(size=0.01) +
    tm_shape(pg_dropoff_8)+
  tm_dots(size=0.01) +
    tm_shape(pg_dropoff_13)+
  tm_dots(size=0.01) +  
  tm_shape(pg_dropoff_20)+
  tm_dots(size=0.01)

9.5.2 Bandwidth Selection

Once, we have applied visualization techniques to briefly analyse the datasets, we will proceed with implementing TKNDE for Punggol planning area. For the steps, we will repeat the procedures in Section 9.4 for this implementation.

w_pg <- rep(1,nrow(pg_destination))
samples_pg <- seq(0, max(pg_destination$dropoff_hr_num), 0.5)

bw_bcv <- bw.bcv(pg_destination$dropoff_hr_num)
bw_ucv <- bw.ucv(pg_destination$dropoff_hr_num)
bw_SJ <- bw.SJ(pg_destination$dropoff_hr_num)

time_kernel_values <- data.frame(
  bw_bcv = tkde(pg_destination$dropoff_hr_num, w = w_pg, samples = samples_pg, bw = bw_bcv, kernel_name = "quartic"),
  bw_ucv = tkde(pg_destination$dropoff_hr_num, w = w_pg, samples = samples_pg, bw = bw_ucv, kernel_name = "quartic"),
  bw_SJ = tkde(pg_destination$dropoff_hr_num, w = w_pg, samples = samples_pg, bw = bw_SJ, kernel_name = "quartic"),
  time = samples_pg
)

df_time <- reshape2::melt(time_kernel_values,id.vars = "time")
df_time$variable <- as.factor(df_time$variable)

ggplot(data = df_time) + 
  geom_line(aes(x = time, y = value)) +
  scale_x_continuous(breaks = dropoff_hr_num) +
  facet_wrap(vars(variable), ncol=2, scales = "free")  + 
  theme(axis.text = element_text(size = 5))

As we notice that the bandwidths plotted above only accounts for temporal aspect. However, both spatial and temporal dimensions are required for implementation of TNKDE. To achieve this, the network and temporal bandwidths can be selected with the leave-one-out-cross validation method (Gelb & Apparicio, 2023). To do so, we can use bws_tnkde_cv_likelihood_calc() function to select the most appropriate bandwidths for the network and time dimensions. It computes the cross-validation likelihood for various bandwidths, enabling a data-driven selection of the most suitable bandwidths for both dimensions.

cv_scores <- bws_tnkde_cv_likelihood_calc(
  bw_net_range = c(100,1000),
  bw_net_step = 100,
  bw_time_range = c(3,24),
  bw_time_step = 3,
  lines = pg_network,
  events = pg_destination,
  time_field = "dropoff_hr_num",
  w = rep(1, nrow(pg_destination)),
  kernel_name = "quartic",
  method = "discontinuous", 
  diggle_correction = FALSE,
  study_area = NULL,
  max_depth = 10,
  digits = 2,
  tol = 0.1,
  agg = 15,
  sparse=TRUE,
  grid_shape=c(1,1),
  sub_sample=1,
  verbose = FALSE,
  check = TRUE)
Reflection

This code chunk was referenced from Gelb’s implementation of TNKDE (2023) as provided in this vignette. Below is a breakdown of some arguments used in this code chunk.

  • bw_net_range = c(100,1000), bw_net_step = 100: These parameters define the range and step size for the network bandwidths that will be evaluated. The function will calculate the cross-validation likelihood for network bandwidths from 100 to 1000, in steps of 100.

  • bw_time_range = c(3,24), bw_time_step = 3: Similarly, these parameters define the range and step size for the time bandwidths that will be evaluated. The function will calculate the cross-validation likelihood for time bandwidths from 3 to 24, in steps of 3.

  • kernel_name = "quartic": This parameter specifies the kernel function to use for the calculation. In this case, the quartic kernel is used.

  • method = "discontinuous": This parameter specifies the method to use when calculating the TNKDE. The discontinuous method is used similar to what we did in NKDE.

  • diggle_correction = FALSE: This parameter specifies whether to use the correction factor for edge effect. In this case, since we are only looking at the neighborhood level, edge correction is not necessary and the correction factor is not used.

  • agg = 15: This parameter specifies a function to aggregate the events when they are too close.

We can see the outputs of bws_tnkde_cv_likelihood_calc() function in a table format using knitr package.

knitr::kable(cv_scores)
3 6 9 12 15 18 21 24
100 -250.91794 -182.71584 -155.27632 -116.78531 -113.26649 -102.39516 -93.35400 -89.81173
200 -145.76563 -90.84170 -69.02605 -61.88347 -56.56890 -49.40677 -49.57432 -49.73512
300 -100.00061 -65.40228 -47.32325 -43.88449 -40.43182 -36.94576 -37.12520 -37.29583
400 -68.95819 -41.82902 -32.92989 -31.35903 -29.73603 -29.93138 -30.11827 -30.29386
500 -54.37326 -34.65375 -29.47804 -29.75000 -28.14349 -28.34934 -28.54072 -28.71939
600 -47.21774 -34.83465 -29.69234 -29.97549 -28.37665 -28.58649 -28.77993 -28.96001
700 -36.42261 -29.55281 -28.07569 -28.36425 -28.59379 -28.80594 -29.00069 -29.18167
800 -32.88500 -27.88168 -28.25843 -28.55089 -28.79093 -29.00503 -29.20083 -29.38249
900 -31.20502 -28.04346 -28.43185 -28.72756 -28.97115 -29.18680 -29.38338 -29.56554
1000 -31.33948 -28.19916 -28.59400 -28.89218 -29.13776 -29.35455 -29.55170 -29.73421

According to the “leave one out cross validation” method, the optimal set of bandwidths is 800 metres and 6 hrs (the combination that gives the least cv score). As expected, larger bandwidths are required because the density of the events are spread both in space and time.

9.5.3 TNKDE Calculation

Now that we have selected an appropriate bandwidth for both spatial and temporal dimensions, we will proceed to implement NKDE calculation. Our first step involves selecting a sample time-step for the calculation. Given that our analysis is based on a 24-hour window, we adopted a 3-hour time-step as a simple rule-of-thumb. Then, we proceed to calculate TNKDE densities.

pg_sample_time <- seq(0, max(pg_destination$dropoff_hr_num), 3)

pg_tnkde_densities <- tnkde(lines = pg_network,
                   events = pg_destination,
                   time_field = "dropoff_hr_num",
                   w = rep(1, nrow(pg_destination)), 
                   samples_loc = pg_samples,
                   samples_time = pg_sample_time, 
                   kernel_name = "quartic",
                   bw_net = 800, bw_time = 6,
                   adaptive = TRUE,
                   trim_bw_net = 900,
                   trim_bw_time = 8,
                   method = "discontinuous",
                   div = "bw", max_depth = 10,
                   digits = 2, tol = 0.01,
                   agg = 15, grid_shape = c(1,1), 
                   verbose  = FALSE)
Reflection

This code chunk was referenced from Gelb’s implementation of TNKDE (2023) as provided in this vignette. Below is a breakdown of some arguments used in this code chunk.

  • bw_net = 800, bw_time = 6: These parameters specify the bandwidths for the network and time dimensions respectively. We use 800 and 6 respectively based on the results of “leave one out cross validation” we carried out in previous section.

  • trim_bw_net = 900, trim_bw_time = 8: These parameters specify the maximum value for the adaptive bandwidth in the network and time dimensions respectively. Here, we use simple rule-of-thumb values.

9.5.4 Creating Animated TNKDE Map

Upon completion of TNKDE densities calculation, we will visulise the results in form of animated GIF.

Firstly, we will use classInt package to create color breaks for densities values. To determine the class intervals, k-means algorithm will be used.

all_densities <- c(pg_tnkde_densities$k)
color_breaks <- classIntervals(all_densities, n = 10, style = "kmeans")

Next, we will generate respective maps for each sample time and compile them all under pg_tnkde_densities using lapply() function. Each map visualizes TNKDE densities for a specific time point. The densities are represented by dots on the map, with the color of the dots indicating the density value. We will use the color breaks and palette specified earlier.

all_maps <- lapply(1:ncol(pg_tnkde_densities$k), function(i){
  time <- pg_sample_time[[i]]
  
  pg_samples$tnkde_density <- pg_tnkde_densities$k[,i]
  map1 <- tm_shape(pg_samples) + 
  tm_dots(col = "tnkde_density", size = 0.01,
          breaks = color_breaks$brks, palette = viridis(10)) + 
    tm_layout(legend.show=FALSE, main.title = as.character(time), main.title.size = 0.5)
  return(map1)
})

Gelb (2023) suggested the use of animated maps (GIF or video) to analyse the results of a TNKDE. hence, we will create an animated GIF from the series of maps stored in all_maps and then display the GIF. To achieve this, we will use tmap_animation() function from the tmap package. Once the GIF image is created, we will use include_graphics() function from the knitr package to display the created GIF.

tmap_animation(all_maps, filename = "images/animated_pg.gif", 
               width = 1000, height = 1000, dpi = 300, delay = 50)
knitr::include_graphics("images/animated_pg.gif")

The TNKDE map for Punggol provides much more nuanced insights into the spatial clustering patterns across various time intervals. This enables a better understanding of how events are distributed and concentrated within the network over different periods. This can be useful for identification of hotspots and trends both spatially and temporally. Overall, the TNKDE approach offers a comprehensive analysis of spatial and temporal clustering within a network, providing a deeper understanding of the dynamics at play.

10.0 Conclusion

This analysis has explored different technqiues of spatial point pattern analysis using Grab’s trajectory datasets. We have carried out three different approaches - Traditional Kernel Density Estimation (KDE), Network Constrained Kernel Density Estimation (NKDE) and Temporal Network Kernel Density Estimation (TNKDE). The findings from these approaches provide valuable insights into the spatial and temporal distribution of Grab’s trajectory datasets for Singapore (and particularly the distribution of destination points).

In this study, we have experimented different bandwidth selection methods and kernel function approaches and compare the results of different combinations. We then developed 3 KDE maps using three distinct approaches: Scott’s rule of thumb for fixed bandwidth, adaptive nearest neighbor, and adaptive kernel density. These KDE maps identified six planning subzones as potential hotspots for spatial clustering of Grab destination points.

We have also carried out comparative study between KDE and NKDE, outlining the limitations of pixel-based traditional KDE against network-based NKDE. We then created NKDE maps for selected six planning areas: Woodlands, Jurong East, Jurong West, Punggol, Tampines, and Toa Payoh. By utilizing network-based NKDE, we were able to overcome these limitations and provide more accurate and insights into the spatial distribution of various factors within the selected planning areas. The results from NKDE showed significant improvements in precision and interpretability compared to traditional KDE maps.

Finally, the implementation of TNKDE with a case study in Punggol further enhanced our understanding by incorporating temporal information, allowing us to identify how spatial clusters change over different time intervals.

Further research could focus on expanding the application of TNKDE to other planning areas in Singapore. Additionally, exploring the potential of incorporating other data sources, such as demographic or socioeconomic data, could provide a more comprehensive understanding of the spatial distribution of Grab taxi origin and destination points and their temporal dynamics. This could help in identifying patterns and trends related to specific demographics or socioeconomic factors that may influence human mobility patterns.

References

  1. Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial Point Patterns: Methodology and Applications with R (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b19708.

  2. Barr, C., & Schoenberg, F.P. (2010). On the Voronoi estimator for the intensity of an inhomogeneous planar Poisson process. Biometrika, 97(4), 977–984.

  3. Boots, B.N., & Getis, A. (1988). Point Pattern Analysis. Reprint. Edited by G.I. Thrall. WVU Research Repository.

  4. Cronie, O., & van Lieshout, M.N. (2018). A non-model-based approach to bandwidth selection for kernel estimators of spatial intensity functions. Biometrika, 105, 455–462.

  5. Davies, T.M., & Baddeley, A. (2018). Fast computation of spatially adaptive kernel estimates. Statistics and Computing, 28(4), 937-956.

  6. Diggle, P.J. (1985). A kernel method for smoothing point process data. Applied Statistics (Journal of the Royal Statistical Society, Series C), 34, 138–147.

  7. Floch, J.-M., Marcon, E., & Puech, F. (n.d.). Spatial distribution of points. In M.-P. de Bellefon (Ed.), Handbook of Spatial Analysis : Theory and Application with R (pp. 72–111). Insee-Eurostat.

  8. Gelb, J., Apparicio, P.: Temporal Network kernel density estimation. Geographical Analysis. 56, 62–78 (2023).

  9. Gimond (2023). Chapter 11 Point Pattern Analysis. Retrieved from https://mgimond.github.io/Spatial/index.html.

  10. González, J. A., & Moraga, P. (2022). An adaptive kernel estimator for the intensity function of spatio-temporal point processes.

  11. Kam, T. S. (2022). R for Geospatial Data Science and Analytics. Retrieved from https://r4gdsa.netlify.app.

  12. Krisp, J.M., Peters, S., Murphy, C.E., & Fan, H. (2009). Visual bandwidth selection for kernel density maps. Photogrammetrie - Fernerkundung - Geoinformation, 2009, 445–454.

  13. Loader, C. (1999). Local Regression and Likelihood. Springer, New York.

  14. O’Sullivan, D., & Wong, D.W. (2007). A surface‐based approach to measuring spatial segregation. Geographical Analysis, 39, 147–168.

  15. Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With Applications in R (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

  16. Rey, S.J., Arribas-Bel, D., & Wolf, L.J. (2023). Point Pattern Analysis. In: Geographic Data Science with python. CRC Press.

  17. Scott, D.W. (1992). Multivariate Density Estimation. Theory, Practice and Visualization. New York: Wiley.

  18. Shen, B., Xu, X., Li, J., Plaza, A., & Huang, Q. (2020). Unfolding spatial-temporal patterns of taxi trip based on an improved network kernel density estimation. ISPRS International Journal of Geo-Information, 9, 683.

  19. Wilkin, J. (2020). Geocomputation 2020-2021 Work Book. University College London. Retrieved from https://jo-wilkin.github.io/GEOG0030/coursebook/analysing-spatial-patterns-iii-point-pattern-analysis.html.

  20. Wolff, M., & Asche, H. (2009). Towards geovisual analysis of crime scenes – a 3D crime mapping approach. Advances in GIScience, 429–448.

  21. Xie, Z., & Yan, J. (2008). Kernel density estimation of traffic accidents in a network space. Computers, Environment and Urban Systems, 32, 396–406.

  22. Yuan, Y., Qiang, Y., Bin Asad, K., & Chow, T. E. (2020). Point Pattern Analysis. In J.P. Wilson (Ed.), The Geographic Information Science & Technology Body of Knowledge (1st Quarter 2020 Edition). DOI: 10.22224/gistbok/2020.1.13.